Recording a sufficient amount of images to train and evaluate computer vision algorithms is usually a time consuming and expensive challenge. This is aggravated when the acquisition of images under various lightning and weather conditions needs to be considered as well. Notwithstanding the aforementioned data collection challenges, the performance of many machine learning algorithms suffer from changing illumination or environmental conditions, SLAM [glover2010fab], place recognition [olid2018single], localization and classification [maddern2014illumination], semantic segmentation [alshammari2018impact]
, 3D human pose estimation[robertini2018illumination] and facial expression recognition [ruiz2018deep]. Since it is impracticable to wait for different weather conditions, day times and seasons to record images under as many variations as possible, it would be beneficial to train machine learning models to become invariant with respect to illumination and the exterior environment. Particularly for safety critical applications, as is common in the automotive industry, it would be of interest to reduce the amount of different illumination conditions necessary to guarantee reliable inference of machine learning models. Improvements on the aforementioned invariances would reduce the amount of mileage and images needed to be recorded and hence reduce the financial risk and time investment while improving the overall robustness of the deployed system.
We aim to transform the input image by removing illumination and environmental features instead of computing more robust and invariant feature descriptors like SIFT [lowe1999object]
or enforcing illumination invariance in deep neural networks through data augmentation. We achieve this by exploiting the availability of sceneries under different illumination and/or environmental conditions. We will introduce a partially impossible reconstruction loss in Section3.1 which enforces similarity in the latent space of encoder-decoder models implicitly, in opposition to an explicit constraint [antelmi2019sparse, zhang2015learning]. In contrast to shadow removal [wang2018stacked, qu2017deshadownet] or relighting [sun2019single, zhang2020portrait], our method removes all the illumination and environmental features together. Our method is neither limited to a specific application where prior knowledge, about faces [zhang2020portrait, shu2017portrait], needs to be included, nor does it need shadow and shadow-free image pairs [wang2018stacked, qu2017deshadownet] to define a ground truth target. We highlight its applicability on multiple datasets and provide evidence for the usefulness of collecting images under these more challenging conditions. Example results on multiple datasets are shown in Fig. 1.
In this work, we focus on the automotive application of occupant classification in the vehicle interior rear bench to demonstrate our proposed method’s applicability. To this end, we release a synthetic dataset for occupant classification in three vehicle interiors where each scenery is rendered under ten different illumination and environmental conditions. We will demonstrate the benefits of combining an encoder-decoder based approach for illumination and environmental feature removal together with a triplet loss regularizer in the latent space. The latter improves the nearest neighbour search on test samples and hence the reliability and generalization to unseen samples. We quantitatively assess this improvement based on the classification accuracy. Our key contributions can be summarized as follows:
We introduce a partially impossible reconstruction cost function in encoder-decoder models to remove illumination and environmental features,
We highlight the importance of a triplet loss regularizer in the latent space of encoder-decoder models to improve generalization to unseen sceneries,
We release the SVIRO-Illumination dataset, which contains 1500 sceneries (once with people only and once with child and infant seats) from three vehicle interiors, where each scene is rendered under 10 different illumination and environmental conditions.
2 Related Work
Datasets: Recording identical, or similar, sceneries under different lightning or environmental conditions is a challenging task. Large scale datasets for identical sceneries under different lightning conditions are currently scarce. The Deep Portrait Relighting Dataset [zhou2019deep] is based on the CelebA-HQ [karras2018progressive] dataset and contains human faces under different illumination conditions. However, the re-illumination has been added synthetically. Regarding the latter constraint, we instead used the Extended Yale Face Database B [GeBeKr01], which is a dataset of real human faces with real illumination changes. While cross-seasons correspondence datasets prepared according to [larsson2019cross] and based on RobotCar [RobotCarDatasetIJRR] and CMU Visual Localization [badino2011visual] could potentially be used for our investigation, the correspondences are usually not exact enough to have an identical scene under different conditions. Moreover, dominantly visible changing vehicles on the streets induce a large difference in the images. Alternative datasets such as St. Lucia Multiple Times of Day [Glover2010ICRA] and Nordland [olid2018single] suffer from similar problems. However, these datasets stem from the image correspondence search, place recognition and SLAM community. We adopt the Webcam Clip Art [lalonde-siggraph-asia-09] to include a dataset for the exterior environment with changing seasons and day times instead. The latter contains webcam images of outdoor regions from different places all over the world.
Consistency in latent space: Existing encoder-decoder based methods try to represent the information from multiple domains [antelmi2019sparse] or real-synthetic image-pairs [zhang2015learning]
identically in the latent space by enforcing some similarity constraints, the latent vectors should be close together. However, these approaches often force networks to reconstruct some (or all) of the images correctly in the decoder part. Forcing an encoder-decoder to represent two images (same scenery, but different lightning) identically in the latent space, yet simultaneously forcing it to reconstruct both input images correctly implies an impossibility: The decoder cannot reconstruct two different images using the same latent space. Antelmi[antelmi2019sparse] adopted a different encoder-decoder for each domain, but as illumination changes are continuous and not discrete, we cannot have a separate encoder or decoder for each possible illumination.
Shadow removal and relighting: Recent advances in portrait shadow manipulation [zhang2020portrait] try to remove shadows cast by external objects and to soften shadows cast by the facial features of the subjects. While the proposed method can generalize to images taken in the wild, it has problems for detailed shadows and it assumes that shadows either belong to foreign or facial features. Most importantly, it assumes facial images as input and exploits the detection of facial landmarks and their symmetries to remove the shadows. Other shadow removal methods [wang2018stacked, qu2017deshadownet] are limited to simpler images. The backgrounds and illumination are usually quite uniform and they contain a single connected shadow. Moreover, the availability of shadow and shadow-free image pairs provides the means of a well defined ground truth. However, this is is not possible for more complex scenes and illumination conditions for which a ground truth is not available or even impossible to define. Image relighting [sun2019single, zhou2019deep] could potentially be used to change the illumination of an image to some uniform illumination. However, as noted in [sun2019single, zhang2020portrait] relighting struggles with foreign or harsh shadows. While it is possible to fit a face to a reference image [shu2017portrait], this option is limited to facial images as well.
We will introduce our proposed partially impossible cost function for encoder-decoder networks to exploit the availability of identical sceneries under different lightning conditions. We will suggest to extend our method by applying a triplet loss regularizer in the latent space to improve generalization. This induces some useful properties such that more robust and reliable results on unseen test samples can be achieved by adopting the nearest neighbour search.
3.1 Partially impossible reconstruction loss
Our proposed partially impossible reconstruction cost function can be applied to any encoder-decoder neural network architecture. Instead of considering the standard autoencoder reconstruction loss defined as the difference between the input image and the decoder reconstruction, we formulate an alternative reconstruction loss based on the decoder reconstruction and a new variation of the input image.
Let be the set of all training images and be the th scene of the training data. For each scene we have images, where each image represents the same scene under different lightning and/or environmental conditions. We denote by the th image out of the images for scene . Hence, the training data can be expressed as for and , where is the total number of unique scenes. Moreover, for . Denote by a subset containing number of sceneries from all the sceneries available in the training data. During training, the batches iterate over the and for each we randomly select to get . Finally, is considered input to the encoder-decoder network and is considered as the target for the reconstruction loss. The aforementioned method is illustrated in Fig. 2. The reconstruction loss can hence be formulated as
where is the encoder and the decoder. The reconstruction loss is computed between the reconstruction of the input image and an image of the same scene under different environmental conditions . In this work, we consider for the reconstruction loss the structural similarity index (SSIM) [bergmann2018improving]:
, but alternative image comparison functions can be considered as well.
Our cost function formulation implies a partially impossible task to solve. The input image does not convey enough information to perfectly reconstruct the same scene under different environmental conditions in its entirety. While contains, usually, all the information of the objects in the scene, it does not contain any information about the illumination or environmental condition of . However, both images are similar enough such that the encoder-decoder model can learn to focus on what is important, the salient features (people). Consequently, the only possibility for the neural network to minimize the loss is to focus on the objects in the scene which remain constant and neglect all the lightning and environment information, because the input images do not include information on how to handle it correctly. This implies that the neural network implicitly learns to focus the reconstruction on the people, objects and vehicle interior and to average out all the other information which changes between the similar scenes, the illumination and environment. This can be observed in Fig. 5 where we compare the reconstruction of similar sceneries after training: all background information and lightning conditions has either been removed or replaced by constant values. The encoder learns to remove the illumination information. The decoder is light invariant and cannot produce different illuminations, since the information has already been removed in the latent space representation.
Our proposed method is not limited to having the same scenery under different illumination conditions. One could use different augmentation transformations on the same input image to form and and hence create the images on the fly. Alternatively, one could apply a reverse denoising approach where only is augmented and is the clean input image. See Fig. S1 in the supplementary material for an example for both approaches.
3.2 Triplet loss and nearest neighbour search
While the aforementioned method works well on the training data, generalizing to unseen test images remains a challenging task if no additional precautions are taken. The illumination is still removed from test samples, but the reconstruction of the objects of interest can be less stable. As training data is limited, the encoder-decoder network is mostly used as a compression method instead of a generative model. Consequently, generalizing to unseen variations cannot trivially be achieved. Example of failures are plotted in Fig. 6 and Fig. 10: it can be observed, that the application on test images can cause blurry reconstructions. It turns out that the blurry reconstruction is in fact a blurry version of the reconstruction of its nearest neighbour in the latent space (or a combination of several nearest neighbours). An example of a comparison of the five nearest neighbours for several encoder-decoder models is shown in Fig. 9.
Consequently, instead of reconstructing the encoded test sample, it is more beneficial to reconstruct its nearest neighbour. However, applying nearest neighbour search in the latent space of a vanilla autoencoders (AE) or variational autoencoders (VAE) will not provide robust results. This is due to the fact that there is no guarantee that the learned latent space representation follows an metric [arvanitidis2018latent]. As the nearest neighbour search is (usually) based on the norm, the latter will hence not always work reliably.
To this end, we incorporated a triplet loss [hoffer2015deep] in the latent space of the encoder-decoder model (TAE) instead. Using the same notations, the triplet loss can be defined as
where is the anchor using scenery , is the positive sample using a different scenery and is the negative sample using another scenery . An illustration of the nearest neighbour inference is given in Fig. 3 and for the triplet loss in Fig. S2. The triplet loss acts as a regularizer and due to its definition, it will also induce an norm in the latent space [min2009deep, cosmo2020limp, arvanitidis2018latent]. This effect is highlighted in Fig. 9, where we compare the nearest neighbours of the AE, VAE and TAE. To take full advantage of the triplet selection, we also modified the reconstruction loss (1) such that it is computed for each of the triplet samples:
where we take for each input image a different random output image . Consequently, the total loss is defined as
We will present an analysis of the aforementioned properties, problems and improvements on the SVIRO-Illumination dataset to highlight the benefit of our design choices. We will present results on two additional publicly available datasets to show the applicability of our proposed cost function to other problem formulations as well.
4.1 Training details
We center-cropped the images to the smallest image dimension and then resized it to a size of 224x224. We used a batch size of 16, trained our models for 1000 epochs and did not perform any data augmentation. We used the Adam optimizer and a learning rate of 0.0001. Image similarity between target image and reconstruction was computed using SSIM[bergmann2018improving]. We used a latent space of dimension 16. The model architecture is detailed in Table S1 in the supplementary material: it uses the VGG-11 architecture [simonyan2014very]
for the encoder part and reverses the layers together with nearest neighbour up-sampling for the decoder part. However, our proposed cost function is not limited to the model’s architecture choice. We used PyTorch 1.6, torchvision 0.7 and pytorch-msssim 2.0[Gongfan2019] for all our experiments.
Based on the recently released SVIRO dataset [DiasDaCruz2020SVIRO], we created additional images for three new vehicle interiors. For each vehicle, we randomly generated 250 training and 250 test scenes where each scenery was rendered under 10 different illumination and environmental conditions. We created two versions: one containing only people and a second one including additionally occupied child and infant seats. We used 10 different exterior environments (HDR images rotated randomly around the vehicles), 14 (or 8) human models, 6 (or 4) children and 3 babies respectively for the training and test split. The four infant and two child seats have the same geometry for each split, but they use different textures. Consequently, the models need to generalize to new illumination conditions, humans and textures. There are four possible classes for each seat position (empty, infant seat, child seat and adult) leading to a total of classes for the whole image. Examples are shown in Fig. 4 and Fig. S3-S5 in the supplementary material.
4.2.1 Reconstruction results
For the triplet loss sampling, we chose the positive sample to be of the same class as the anchor image (but from a different scenery) and the negative sample to differ only on one seat (change only the class on a single seat w.r.t. the anchor image). Images of three empty seats do no contain any information which could mislead the network, so to make it more challenging, we did not use them as negative samples.
After training, the encoder-decoder model learned to remove all the illumination and environmental information from the training images. See Fig. 5 for an example on how images from the same scenery, but under different illumination, are transformed. Sometimes, test samples are not reconstructed reliably. However, due to the triplet constraint and nearest neighbour search, we can preserve the correct classes and reconstruct a clean image: see Fig. 6 for an example. The reconstruction of the test image latent vector produces a blurry person, which is usually a combination of several nearest neighbours. The reliability of the class preservations is investigated in Section 4.2.3 based on the classification accuracy. We want to emphasize that the model is not learning to focus the reconstruction to a single training image for each scenery. In Fig. 7 we searched for the closest and furthest (w.r.t. SSIM) input images of the selected scenery w.r.t to the reconstruction of the first input image. Moreover, we selected the reconstruction of all input images which is furthest away from the first one to get an idea about the variability of the reconstructions inside a single scenery. While the reconstructions are stable for all images of a scenery, it can be observed that the reconstructions are far from all training images. Hence, the model did not learn to focus the reconstruction to a single training sample, but instead learned to remove all the unimportant information from the input image. The shape and features of the salient objects are preserved as long as their position remains constant in each image, see Fig. 11 for vehicles being removed if not contained in each image. The texture of the salient objects is uniformly lit and smoothed out.
4.2.2 AE vs. VAE vs. TAE
For visualization purposes, we trained a vanilla autoencoder (AE), variational autoencoder (VAE) and triplet autoencoder (TAE) on the SVIRO-Illumination dataset with people and empty seats only. For simplicity of visualization, we chose a latent space dimension of 2 for the model definition. After training, we computed the latent space representation for all training samples and plotted the resulting distributions in Fig. 8
. The triplet based encoder-decoder model separates and clusters the classes best. Some small clusters are due to under-represented classes, for which the model clusters images from the same scenery under different illuminations together. The AE uses a large range of possible values in the latent space and both the AE and VAE contain wrong classes inside other clusters. The test distribution is plotted in Fig. S6 in the supplementary material and highlights the additional benefit of the TAE for potential outlier detection. Moreover, we show in Fig. S7 and Fig. S8 that a 2-dimensional principal component analysis and T-SNE projection of a 16-dimensional latent space provides even further benefits when a TAE is used. The same models were trained with a latent space dimension of 16 including occupied child and infant seats. The classification results obtained by nearest neighbour search are compared against several other models in Section4.2.3. The TAE outperforms the other encoder-decoder models w.r.t. accuracy.
We needed to adjust the weight in the loss for the KL divergence (regularizer w.r.t. Gaussian prior) to for training the VAE and prevent mode collapses. This is due to the background of the vehicle interior which is dominant in all training samples and remains similar.
It is important to note that the comparison between AE, VAE and TAE is not entirely fair, because the latter implicitly uses labels during the positive and negative sample selection. Nevertheless, for the problem formulations at hand, it is beneficial to collect the classification labels considering the additional advantage of the induced norm in the latent space and improved classification accuracy.
4.2.3 Classification results
We further compared the classification accuracy of our proposed method together with the nearest neighbour search against vanilla classification models when the same training data is being used. This way, we can quantitatively estimate the reliability of our proposed method against commonly used models. To this end, we trained baseline classification models (ResNet-50 [he2016deep], VGG-11 [simonyan2014very] and MobileNet V2 [sandler2018mobilenetv2]) as pre-defined in torchvision on SVIRO-Illumination. For each epoch, we randomly selected one for each scenery
. The classification models were either trained for 1000 epochs or we performed early stopping with a 80:20 split on the training data. We further fine-tuned pre-trained models for 1000 epochs. The triplet based autoencoder model is being trained exactly as before. During inference, we take the label of the nearest training sample as the classification prediction. The random seeds of all libraries were fixed for all experiments and cuDNN was used in deterministic mode. Each setup was repeated 5 times with 5 different (but the same ones across all setups) seeds. Moreover, the experiments are repeated for all three vehicle interiors. The mean classification accuracy over all 5 runs together with the variance is reported in Table1. Our proposed method significantly outperforms vanilla classification models trained from scratch and the models’ performances undergo a much smaller variance. Moreover, our proposed method outperforms fine-tuned pre-trained classification models, despite the advantage of the pre-training of these models. Additionally, we trained the encoder-decoder models using the vanilla reconstruction error between input and reconstruction, but using the nearest neighbour search as a prediction. Again, including our proposed reconstruction loss improves the models’ performance significantly.
|MobileNet-ES||62.9 3.1||71.8 4.3||73.0 0.8|
|VGG11-ES||64.4 35||74.0 19||75.5 5.7|
|ResNet50-ES||72.3 3.7||77.9 35||76.6 9.9|
|MobileNet-NS||72.7 3.8||77.0 4.1||77.4 2.2|
|VGG11-NS||74.1 5.8||71.2 14||78.4 2.6|
|ResNet50-NS||76.2 18||83.1 1.1||82.0 3.2|
|MobileNet-F||85.8 2.0||90.6 1.2||88.6 0.6|
|VGG11-F||90.5 2.0||90.3 1.2||89.2 0.9|
|ResNet50-F||87.9 2.0||89.7 6.1||88.5 1.0|
|AE-V||74.1 0.7||80.1 1.8||73.3 0.9|
|VAE-V||73.4 1.3||79.5 0.6||73.0 0.9|
|TAE-V||90.8 0.3||91.7 0.2||89.9 0.6|
|AE (ours)||86.8 0.3||86.7 1.5||86.7 0.9|
|VAE (ours)||81.4 0.5||86.6 0.9||85.9 0.8|
|TAE (ours)||92.4 1.5||93.5 0.9||93.0 0.3|
4.3 Extended Yale Face Database B
The Extended Yale Face Database B [GeBeKr01] contains images of 28 human subjects under 9 poses. For each pose and human subject, the same image is recorded under 64 illumination conditions. We considered the full-size image version instead of the cropped one and used 25 human subjects for training and 3 for the testing. We removed some of the extreme dark (no face visible) illumination conditions. Example images from the dataset are plotted in Fig. 10.
For the triplet sampling we chose as a positive sample an image with the same head pose and for the negative sample an image with a different head pose. We report qualitative results of a trained model in Fig. 10. The model is able to remove lightning and shadows from the training images, but the vanilla reconstruction on test samples can be blurry. We are not using the center cropped variant of the dataset, which makes the task more complicated, because the head is not necessarily at the same position for different human subjects. Nevertheless, the model is able to provide a nearest neighbour with a similar head pose and head position.
4.4 Webcam Clip Art
The Webcam Clip Art [lalonde-siggraph-asia-09] dataset consists of images from 54 webcams from places all over the world. The images are recorded continuously such that a same scenery is available for different day times, seasons and weather conditions. For each of the 54 regions, we selected randomly 100 sceneries. Example images are provided in Fig. 11.
For the triplet sampling, we chose as positive sample an image from the same location and for the negative sample an image from a different location. Each landscape and building arrangement undergoes unique shadow, illumination and reflection properties. The generalization to unknown places under unknown illumination conditions is thus too demanding to be deduced from a single input image. Hence, we do not provide a test evaluation and report results on training samples only in Fig. 11. The model removes the illumination variations and shadows from the images. Moreover, rivers, oceans and skies as well as beaches are smoothed out. Most of the people and cars are removed and replaced by the actual background of the scenery.
Our proposed method works well on the training data, which can be sufficient for some applications, when a fixed dataset is available on which some post-processing needs to be done only. Since the generalization to test images can be achieved by a nearest neighbour search, the latter will only be useful for a subset of machine learning tasks. Our method preserves the classes for a given problem formulation, which will be fine for classification and object detection. Although our method preserves even head poses (e.g. Fig. 10) when it is dominantly present in the training images, our approach will likely not preserve complex human poses (e.g. Fig. 6) or detailed facial landmarks, because the body poses and key features are not necessarily preserved by the nearest neighbour search. Future work should try to incorporate constraints such that the poses and landmarks of test samples are preserved as well.
In practise, it will be challenging to record identical sceneries under different lightning conditions. However, as the Extended Yale Face Database B [GeBeKr01] and Webcam Clip Art [lalonde-siggraph-asia-09] dataset have shown, it is also feasible. Since we have highlighted the benefit of the acquisition of said datasets, the investment of recording under similar conditions in practise can be worth for some applications. We believe that future work will develop possibilities to facilitate the data acquisition process. Moreover, the possibility to incorporate images taken for the same scene, but in less perfect conditions, should be explored (e.g. Fig. S1).
Our results show the benefit of recording identical sceneries under different lightning and environmental conditions such that unwanted features can be remove by a partially impossible reconstruction loss function: without the need for a ground truth target image. Our method works well for classification and post-processing tasks due to an enhanced nearest neighbour search induced by a triplet loss regularization in the latent space of an encoder-decoder network. We demonstrated the universal applicability of our proposed method, as long as the correct data (i.e. same scenery under different conditions) is available, on three different tasks and datasets. Moreover, our proposed method improves classification accuracy significantly compared to standard encoder-decoder and classification models, even when the latter was a fine-tuned pre-trained model.
The first author is supported by the Luxembourg National Research Fund (FNR) under grant number 13043281. This work was partially funded by the Luxembourg Ministry of the Economy under grant number CVN 18/18/RED.