Road scene analysis is a vital task for driving assistance systems. Robust road users detection including vehicles, pedestrians, animals, is a precondition for a secure navigation. The complexity and variability of outdoor illumination conditions make RGB-based detection algorithms poor and limited, especially when they are faced to strong reflections, noises or bad weather conditions. In contrast, polarization encoded images are a non-conventional modality where each reflected wave light from a pixel is strongly linked to the physical properties of the surface. The significant interest resides in the fact that polarimetric imaging is a rich modality that enables to characterize an object by its reflective properties. In a polarimetric image, each pixel encodes information regarding the object’s roughness, its orientation and its reflection [wolff1995polarization]. Applications of polarimetric imaging range from indoor autonomous navigation [berger2017depth]
, depth map estimation[Zhu_2019_CVPR], 3D objects reconstruction [morel2006active] to differentiation of healthy and unhealthy cervical tissues in order to detect cancer at an early stage [rehbinder2016ex]. Also, recently, polarization imaging was exploited in autonomous driving applications either to enhance car detection [fan2018polarization], road mapping and perception [aycock2017polarization] or to detect road objects in adverse weather conditions [blin2019road]
. The key element of this work comes from the combination of polarization richness and deep learning-based detection. However, these applications are still characterized by the reduced size of the available training databases which restrains them from using deep neural networks. To overcome this limitation, one possible solution is the generation of polarimetric encoded images.
Generative Adversarial Networks (GAN) [goodfellow2014generative, wang2019generative]
are powerful deep generative models used to implicitly learn complex data distributions and to generate realistic samples from them. In its standard form, a GAN consists of two models: a generator which maps samples drawn from a latent low-dimensional distribution (usually uniform or Gaussian distributions) to high-dimensional points expected to follow the sought data distribution, and a discrimination model which discriminates the real samples from the generated ones[goodfellow2014generative]. GAN have proven remarkable for various application domains including image generation [arjovsky17a]isola2017image, zhu2017unpaired, hoffman18a_cycada] or image attribute manipulation [Antipov_face_aging] to name a few.
Arguably most of the impressive achievements of GAN were obtained for color images. A body of work attempted to extend GAN architectures to other uncommon imaging domains. For instance, some existing methods rely on CycleGAN [zhu2017unpaired], an image-to-image translation network, to generate infrared road scenes from RGB counterpart images [zhang2018synthetic], to produce thermal images for person re-identification [Kniaz_2018_ECCV_Workshops]
or for infrared image colorization[mehri2019colorizing]. In the same vein, data augmentation in the field of medical imaging was achieved [nie2017medical] by transforming MRI inputs into pseudo-CT images. Following the previous stream of work, this paper contributes to generative models for non-conventional imaging techniques. Specifically we propose a generative model framework to produce realistic polarimetric images.
Unlike RGB, LiDAR or infrared image generation which mostly responded to visual qualitative constraints, unless some learnable knowledge constraints are enforced (see [hu2018deep] for pose conditional person image generation), sampling polarization images is more challenging. Indeed, this imaging technique comes with physical admissibility constraints on the pixels of an image [bass1995handbook]
. To be physically feasible, each pixel entry of such an image should satisfy some physical constraints related to light polarization principle and to the calibration setup of the acquisition devices. Therefore, we propose a set of constraints that, when satisfied, ensure that the generated images are physically feasible. Based on these constraints, we formulate our problem of polarimetric image generation as a domain-transfer learning problem under physical constraints to ensure that the generated images are valid. We propose a learning framework based on CycleGAN[zhu2017unpaired], which enables unpaired image-to-image translation with relatively few images, to which we add constraints for handling the physical polarization constraints during training. This allows for circumventing the expensive labelling step by transferring a source labelled RGB dataset to the polarimetric domain while keeping the shapes and contents of the source image unchanged.
We demonstrate the effectiveness of our constrained-output CycleGAN on the KITTI dataset [geiger2012we] as well as on the Berkeley Deep Drive dataset (BDD100K) [xu2017end], which are commonly used for object detection in road scenes. Using the generated polarization-encoded images to train a deep object detector, we witness an improvement of the detection performances of cars and pedestrians which are of great interest for autonomous driving applications.
To summarize, the contributions of this paper are:
as far as our knowledge goes, we propose the first framework for generating physically admissible polarization-encoded images starting from RGB images,
we propose an extension of CycleGAN which allows to generate polarimetric-encoded images while handling the physical constraints that should be satisfied by the pixels of the generated images,
we show that, when plugged into the training procedure of an object detector for pre-training, the generated images help improving the detection performances.
The remainder of the paper is organized as follows: the polarization formalism and the physical constraints it involves are first presented. Then, the CycleGAN approach is described and a way to take into account these physical constraints during the training process of the CycleGAN for generating polarimetric images is investigated. Experimental evaluations are conducted; they aim to translate RGB images from KITTI and BDD100K datasets into polarimetric images. Finally, the generated images are exploited to boost the performances of an object detection network. The code for the experiments and the trained models are available at: https://anonymous.4open.science/r/4a83820e-9c65-417c-af3a-ab2979d6e2e8/
This section introduces the polarization formalism, the GAN approach and the CycleGAN principles. It focuses on the formulation of the proposed modelling framework, namely the learning of a CycleGAN with output constraints. A solution approach and the related learning principle are presented.
2.1 Physical principles of polarimetry
Polarization is the direction of propagation of the electrical field of the light wave. The electrical field of the wave propagating in direction at time
is characterized by the scalar components of the vector
with and the maximum amplitudes of each component. is the phase shift between the two components and , is the angular frequency, represents the wave number, directly related to the wavelength . Figure 1 depicts the different parameters.
The two components of the electrical field could be combined by eliminating the temporal term in the following way :
According to , three states of polarization could be observed. If the polarization is said to be right-handed, otherwise, if , the polarization is left handed. If , the trajectory of the wave is linear and the polarization is known as linear polarization. If , the polarization of the wave is respectively right and left circular. For these three cases the wave is totally polarized, otherwise, if is not constant over the time, the wave is unpolarized [bass1995handbook]. Usually, in an uncontrolled environment, a light wave is a combination of a totally polarized part and an unpolarized one, this is the reason why the natural light is known to be partially polarized.
2.2 Mathematical representation
There are different ways to describe the polarization of the light and its interaction with the media. The Stokes parameters are one of them to describe the polarization state of an electromagnetic wave, especially in the general case of a partially polarized wave form. The Stokes vector contains four polarization parameters where each component brings us information on the reflected light wave as follows:
represents the total intensity of the light which is not a polarization property. is always positive, which means that every natural object, when illuminated, reflects a light,
defines the amount of the polarized light in the wave horizontally or vertically polarized, meaning that the electrical field represented by figure 2 moves in a horizontal or a vertical way,
is the amount of the polarized light oriented at and and,
represents the amount of the circular polarization.
In the case of outdoor scenes, the sunlight is unpolarized and it is physically proved that when an unpolarized light wave is being reflected, it becomes partially linearly polarized. Its polarization depends on the normal surface and the refractive index of the material it impinges [wolff1995polarization]. The circular component is thus neglected and set to zero in all applications concerning non controlled environments [morel2006active]. We focused in our work on the only linearly polarized light represented by the three first components of the Stokes vector, namely .
In order to access to observable optical field, the two components and are averaged over the time. The polarization ellipse of Equation 2 average in case of linear polarization is given by the following equation:
The averaged values of the electrical components are:
By replacing the time averaged terms in Equation 3, the elliptical equation can be rewritten in the following form
The linear Stokes vector is finally derived from Equation 7 as
One salient physical property, obtained from the Stokes parameters, is the degree of linear polarization (DOP) [ainouz2013adaptive] defined by:
The (or equivalently between 0 and 100% ). It refers to the rate of the polarized part in the light wave. It is equal to 1 for a totally polarized light, 0 for unpolarized light and between 0 and 1 for partially polarized light.
Equation 7 describes the totally polarized wave. By construction of the Stokes vector, the total energy in this case is equal to the partial energy meaning that the light is totally composed of polarized waves linearly oriented. In case of partially polarized wave, the wave contains an unpolarized part, the equality is transformed to an inequality meaning that . From this property, the Stokes vector is said to be physically admissible if and only if the two following conditions are met:
This condition means also that the DOP never exceeds 1, otherwise the partial energy of the wave may exceed its total energy, which is, physically, an impossible phenomenon.
2.3 Stokes imaging
Polarimetric images are obtained by computing the Stokes vector related to each pixel. The acquisition principle is based on a device composed of a polarizer oriented at an angle between the object and the sensor [Wang_2019_CVPR]. At least three acquisitions with three different angles are required to get the Stokes parameters. The reflected light from the object, represented by the unknown Stokes vector, passes through the polarizer before reaching the camera. In our work, we use a Polarcam 4D Technology polarimetric camera, providing simultaneously four images respectively obtained with four different linear polarizers oriented at (0°, 45°, 90°, 135°). The polarimetric camera measures an intensity of the scene for each angle . The relationship between the Stokes vector and the intensities reaching the camera is given by:
where refers to the four intensities according to each angle of the polarizer and , to the calibration matrix of the polarization camera, defined as:
To get the unknown Stokes parameters from the measured intensities (see Equation (12)), we require the pseudoinverse of the matrix . The relationship between and is then defined by:
which is satisfied if and only if:
2.4 Unpaired image-to-image translation with CycleGAN
Given two domains and , unpaired image-to-image translation is the task of learning the mapping functions and using unpaired samples with and with . An effective approach to achieve the task is CycleGAN [zhu2017unpaired]. It consists in learning the two mapping models and by combining the objective function of the standard Generative Adversarial Network (GAN) [goodfellow2014generative]
with a Cycle-Consistency loss function. The adversarial cost related to the GAN serves for training the models to generate samples that will match the target domain distribution, while the Cycle-Consistency cost ensures that the learned models are able to correctly reconstruct an original image (of the source domain) from a generated one.
Formally a GAN is composed of a generative model which maps a known distribution , usually normal or uniform, to the unknown distribution of the samples and a discrimination model . The generator attempts to fool the discriminator , which in turn tries to distinguish a real sample from a sample generated by the model . Learning a GAN amounts to solve the following problem:
CycleGAN trains the two models and by using solely unpaired real samples and respectively drawn according to the (unknown) distributions and as input. It also learns two discrimination networks and able to detect generated samples from real ones in the domains and respectively. CycleGAN relies on the Least-Squares variant of GAN [mao2017least] and considers the following adversarial costs:
In order to ensure the cyclic consistency, i.e. both the compositions and are identity functions, a reconstruction error term is devised for the mapping models:
Gathering all these elements leads to the objective function:
where is an hyper-parameter that controls the reconstruction term. Training a CycleGAN consists in solving, via alternate gradient descent, the following minmax problem:
2.5 Proposed approach
As discussed above, our main goal is to learn a generative model able to produce realistic polarization-based images starting from RGB images. As our proposed solution to this problem, we adopt the image-to-image translation framework and extend it to account for the constraints a polarimetric image must fulfill.
To generate a polarimetric image from an RGB one, we propose to use the CycleGAN approach to learn the translation models and between the domain of the polarimetric images and the RGB image domain. Let be the intensity vector associated to a pixel of a generated polarimetric image. To be physically admissible, each pixel has to satisfy the admissibility (10) and the calibration (16) constraints. We refer to these polarimetric constraints by , and as follows:
is always positive as it represents the total intensity reflected from an object. As the last layer of the generation models customary uses the hyperbolic tangent as activation function, each output intensityis within the range which we scale to . Hence (see (14)) is ensured to be strictly positive. Therefore, constraint can be deemed satisfied for the generated polarimetric images. To handle the remaining constraints and , one could resort to the Lagrangian dual of CycleGAN optimization problem (24) subject to these constraints. However, this may be computationally expensive, as it requires to optimize four neural networks (respectively the discrimination and the mapping network models) in an inner loop of a dual ascent algorithm. Moreover the overall optimization procedure may not be stable because of the minmax game involved in the CycleGAN learning.
In order to derive an efficient algorithm to learn CycleGAN under output constraints, we introduce a relaxation of the problem. Instead of strictly enforcing the constraints, we measure how far the generated image pixels are from the feasibility domain through additional cost functions we attempt to minimize. For the constraint , we propose to use a distance between the generated image and as:
with the Stokes vector computed from the generated image by using equation (14). Similarly, to enforce the constraint , a rectified linear penalty is considered, defined as:
The loss enforces the respect of the acquisition conditions according to the calibration matrix while pushes the generated images towards respecting the physical admissibility constraint on the Stokes vectors obtained from the generated image.
Gathering all these elements, we train our CycleGAN by optimizing the following objective function:
The full training algorithm is summed up in Algorithm 1
The non-negative hyper-parameters and control respectively the balance of calibration and admissibility constraints according to the CycleGAN loss (see (23)). As the values of and are computed pixel-wisely, we consider their averages over the whole image in the objective function. The training principle of the proposed generative model is illustrated in Figure 5.
3 Experimental evaluation
Hereafter, the experimental setup, including the image generation procedure and its evaluation, is presented.
3.1 Polarimetric images generation using CycleGAN
To conduct the experiments, we rely on a combination of two datasets presented in [blin2020new, blin2019road], composed of polarimetric and RGB images. From these datasets we select 2485 unpaired images from each domain (RGB and polarimetry). Example instances are shown in Figure 6. The polarimetric images are of dimension . The latter dimension is due to the four intensities acquired by the camera, namely and . The RGB images are of dimension .
Our constrained CycleGAN is trained for 400 epochs on randomly cropped patches of size. As for the constraints, we found experimentally that setting the hyper-parameters and in (28) provides the best performances. The hyper-parameter , controlling the reconstruction cost, is set to . The learning rate is decreased linearly from to during the 400 training epochs.
To evaluate the effectiveness of our trained generative model, we consider KITTI and BDD100K (only using daytime images since polarimetry fails to characterize objects during nighttime) which often serve as testbed in applications related to road scene object detection. The constrained-output CycleGANs we train, are used to transfer RGB images from KITTI and BDD100K to the polarimetric domain. The resulting datasets are denoted respectively as Polar-KITTI and Polar-BDD100K. Since the CycleGAN architecture is fully convolutional, it has no requirement on the input image’s size. Therefore, even though the model was trained on patches, it scales straightforwardly to the images of size from KITTI and of size from BDD100K datasets. We also randomly horizontally flip, as flipping the image does not alter the physical properties of a polarimetric image [blanchonPolarimetricImageAugmentation2021].
To assess whether or not fulfilling the physical constraints is paramount, we investigate a variant of Polar-KITTI and Polar-BDD100K: we learn a standard unconstrained CycleGAN based on the same unpaired RGB/polarimetric images.
3.2 Evaluation of the generated images
In order to assert the ability of the generated Polar-KITTI and Polar-BDD100K datasets to preserve the relevant features for road scene applications, we train a detection network following the setup in Figure 4. For this experiment, a RetinaNet-50 [lin2017focal] pre-trained on the MS COCO dataset [lin2014microsoft] is fine-tuned in three different settings. In the first setup, the detection model is fine-tuned on the original RGB KITTI (or BDD100K) while the second one considers fine-tuning on the generated polarimetric images from KITTI (Polar-KITTI) or BDD100K (Polar-BDD100K) datasets. The third one uses the unconstrained variant of the generated images from KITTI or BDD100K datasets. After the first fine-tuning, the three final detection models are fine-tuned one last time on the real polarimetric dataset (described in Table 1).
Overall, the trained CycleGAN and detection networks under these settings are evaluated in qualitative and quantitative ways. The end goal is to check: (i) the ability of the generated images to help learning polarimetry-based features for object detection, and (ii) the influence of respecting the polarimetric feasibility constraints on detection performances.
We measure the visual quality of the generated images by computing the classical Fréchet Inception Distance [heusel2017]. Computing this distance requires to extract visual features from each set of images (real and generated) using a pre-trained deep neural network (usually an Inception v3 [szegedy2016]
network pre-trained on ImageNet[deng2009imagenet]) and to evaluate the Fréchet (or Wasserstein) distance between the distributions of these features, which are assumed to be Gaussian distributions. We evaluate this distance using 509 images from each generated polarimetric dataset and from the test set as described in Table 1.
As for the feature extractor, since the classical Inception v3 network is not adapted to polarimetric images, we use the convolutional part of a polarimetry-adapted RetinaNet detection network [blin2019road], which has been trained on the MS-COCO dataset and fine-tuned on the same real polarimetric dataset that we used in our CycleGAN experiments. In order to evaluate the improvements in the detection, we compute the error rate evolution . The improvement on the detection of the object is given by:
where and respectively denote the average precision for object detection in RGB and in polarimetric images. A negative means that was improved over .
3.3 Results and discussion
First the qualitative coherence of the generated images is evaluated. The polarimetric images generated from their RGB equivalent are shown in Figure 7.
|KITTI RGB||person||0.663||N/A||BDD100K RGB||person||0.736||N/A|
|with contraints||car||0.794||-0.04||with constraints||car||0.815||0.03|
|with constraints||1.55 3.36%||0.14%|
As for the constraints, Table 3 shows how including them to the CycleGAN’s loss helps generating images which better fulfill the physical polarimetric properties at the pixel scale. The errors related to the constraints and on generated images are consistent with the observed measurement errors on the real images, whereas the unconstrained approach yields poor results. Constraint is met for all generated images thanks to the use of hyperbolic tangent as activation function for the last layer of the generative models. Additionally, the obtained Fréchet Inception Distances are of 6022.7 for the unconstrained CycleGAN and 4485.1 for our approach. Note that the scale of the FID scores computed with the pre-trained RetinaNet is larger than when using a pre-trained Inception v3 network. Thus, these scores are to be interpreted as relative metrics. This indicates that taking the constraints into account improves visual and physical quality of the generated samples.
Next, as an example, we show the benefits of the generated images through an object detection task. This enables to check if objects contained in the generated scene are physically coherent. A RetinaNet-based detection model is learned according to the setups described in Section 3.2 and the obtained detection performances in term of mean average precision () are summarized in Table 2. We chose not to evaluate the bike and motorbike detection performances as the polarimetric dataset does not contain enough objects of those two classes.
As we can see in Table 2, using the generated polarimetric images improves the detection performance in real polarimetric images. The improvement is substantial for car and pedestrian detection. We achieve a 4% improvement of the error rate (see Equation 29) for car detection and of 12% for pedestrian detection which leads to a global improvement of 9% in the detection, using Polar-KITTI with constraints. Similarly, for the Polar-BDD100K dataset, we notice an improvement of 10% for pedestrian detection which leads to an increased of 5% (pedestrians and cars). However, we notice that, for BDD100K, similar detection performances are obtained either for RGB or polarimetric images. This is due to the fact that generated images using CycleGAN tends to lack precision on small objects. In order to asses the impact of this effect, we compare the evolution of the detection scores when varying the minimal area of the bounding boxes from which objects are taken into account for the detection task. The obtained results are shown for the Polar-BDD100K and the RGB BDD100K datasets are shown in Figure 8 and illustrate that, when the minimal area of bounding boxes increases, the of car regarding the training including Polar-BDD100K, overcomes the one including RGB BDD100K. Even if the results are limited by the lower quality of the small objects in the generated images in this specific case, we can conclude that the generated polarimetric images help improving the overall detection results.
4 Conclusion and future work
In this work, we propose an efficient way to generate realistic polarimetric images subject to physical admissibility constraints. We design two loss terms that help enforcing these constraints and propose to adapt the CycleGAN algorithm to achieve the generation of physically admissible images. To train the proposed output-constrained CycleGAN, we combine the standard CycleGAN’s objective function with two designed cost functions in order to handle the feasibility constraints related to each polarization-encoded pixel in the image. With the proposed generative model, we successfully translate RGB images from road scenes to polarimetric images showing an enhancement of the detection performances. Future work would consist in improving the quality of the small objects in generated images. Another extension could be the generation of polarimetric images to other domains such as medical and Synthetic-Aperture Radar (SAR) imaging. Additionally, to strictly ensure the physical feasibility constraints, solutions such as directly addressing the optimization task involved by the genuine constrained CycleGAN problem, instead of its proposed relaxation, using methods such as proximal gradient descent method could be explored. Finally, since the constraints we develop are not limited to the generative methods, they might be applicable to some other augmentation techniques, such as rotation or noise.
This work is supported by the ICUB project 2017 ANRprogram : ANR-17-CE22-0011. We also thank our colleagues at CRIANN, who provided us with the computation resources necessary for our experiments.