Guided Disentanglement in Generative Networks

by   Fabio Pizzati, et al.

Image-to-image translation (i2i) networks suffer from entanglement effects in presence of physics-related phenomena in target domain (such as occlusions, fog, etc), thus lowering the translation quality and variability. In this paper, we present a comprehensive method for disentangling physics-based traits in the translation, guiding the learning process with neural or physical models. For the latter, we integrate adversarial estimation and genetic algorithms to correctly achieve disentanglement. The results show our approach dramatically increase performances in many challenging scenarios for image translation.



There are no comments yet.


page 1

page 8

page 9

page 10

page 12

page 13

page 15


Improving Style-Content Disentanglement in Image-to-Image Translation

Unsupervised image-to-image translation methods have achieved tremendous...

An Efficient Image-to-Image Translation HourGlass-based Architecture for Object Pushing Policy Learning

Humans effortlessly solve pushing tasks in everyday life but unlocking t...

Latent Filter Scaling for Multimodal Unsupervised Image-to-Image Translation

In multimodal unsupervised image-to-image translation tasks, the goal is...

Generative Reversible Data Hiding by Image to Image Translation via GANs

The traditional reversible data hiding technique is based on cover image...

Analogical Image Translation for Fog Generation

Image-to-image translation is to map images from a given style to anothe...

CoMoGAN: continuous model-guided image-to-image translation

CoMoGAN is a continuous GAN relying on the unsupervised reorganization o...

Joint haze image synthesis and dehazing with mmd-vae losses

Fog and haze are weathers with low visibility which are adversarial to t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Guided disentanglement. Our method enables disentanglement of target scenes characteristics, from guidance which might be neural or physical models, both differentiable and non-differentiable. While naive GANs generate entangled target images, we learn a disentangled version of the scene from guidance of model with estimated physical () or neural () parameters. Notice here the unrealistic raindrops entanglement in naive GANs. In contrast, our disentangled GAN prevents entanglement enabling the generation of target style () or unseen scenarios (here, , ).

Image-to-image (i2i) translation GANs can learn mappings in an unsupervised manner, finding great applicability in artistic style transfer, content generation, and other scenarios zhu2017unpaired; liu2017unsupervised; isola2017image. When coupled with domain adaptation strategies hoffman2017cycada; li2019bidirectional; toldo2020unsupervised, they also provide an alternative to manual annotation work for synthetic to real bi2019deep or challenging conditions generation qu2019enhanced; shen2019towards; pizzati2019domain.

Nowadays, i2i networks exhibit great performances in learning numerous characteristics of target scenes simultaneously, which would otherwise be impractical to model with alternative approaches. However, a common pitfall is their inability to accurately learn the underlying physics of the transformation xie2018tempogan, often generating artifacts based on inaccurate mapping of source and target characteristics, which significantly impact results. This is the case for example when learning as naive GAN translation will inevitably entangle inaccurate raindrops, as highlighted in Fig. 1 top. In this paper we show that this is amenable to the extraction of entangled representations of scene elements and physical traits. On the other hand, model-based approaches are commonly used to render well-studied elements of target domain with great realism roser2009video; alletto2019adherent; halder2019physics; tremblay2020rain, but leaving any other domain gap unmodified.

We propose a learning-based comprehensive framework to unify generative networks with priors guidance. Our proposal relies on a disentanglement strategy integrating GAN or physics model guidance with an adversarial training having optimal parameters regressed on the target set. Ultimately, we partially render the target scene with a neural or physical model, while learning the un-modeled target characteristics with an i2i network, in a complementary manner, and we compose them as shown in Fig. 1 to get the final inference. Besides increasing image realism, our physical model-guided framework enables fine-grained control of physical parameters in rendered scenes, for increasing generated images variability regardless of the training dataset. This is obviously beneficial for outdoor robotics applications, which require resistance to various unobserved scenarios.

In this paper, we significantly extend our prior work pizzati2020model, which focused on occlusion disentanglement of differentiable models, by opening to non-differentiable models (Sec. 3.3.2), with an entirely novel GAN-guided disentanglement pipeline with new ad-hoc experiments (Secs. 3.2, 4.3), a new geometry-dependent task (Sec. 4.2.2), by further extending the evaluation (Sec. 4.2.3), and finally by both extending ablations (Sec. 4.4) and discussion (Sec. 5).

2 Related works

2.1 Image-to-image translation

The seminal work on image-to-image translation (i2i) using conditional GANs on paired images was conducted by Isola et al. isola2017image, while wang2018pix2pixHD exploits multi-scale architectures to generate HD results. Zhu et al. zhu2017unpaired propose a framework working with unpaired images introducing cycle consistency, exploited also in early work on paired multimodal image translation zhu2017toward. A similar idea is proposed in yi2017dualgan.

There has been a recent trend for alternatives to cycle consistency for appearance preservation in several approaches amodio2019travelgan; benaim2017one; fu2019geometry, to increase focus on global image appearance and reduce it on unneeded textural preservation. In nizan2020breaking, they propose a cycle consistency free multi-modal framework. Many methods also include additional priors to increase translation consistency, using objects shen2019towards; bhattacharjee2020dunit, instance mo2018instagan, geometry wu2019transgaga; arar2020unsupervised or semantics li2018semantic; ramirez2018exploiting; tang2020multi; cherian2019sem; zhu2020semantically; zhu2020sean; lin2020multimodal; ma2018exemplar; lutjens2020physicsinformed

. Other approaches learn a shared latent space using a Variational Autoencoder, as in Liu et al. 


Recently, attention-based methods were proposed, to modify partly input images while keeping domain-invariant regions unaltered mejjati2018unsupervised; ma2018gan; tang2019attention; kim2019u; Lin_2021_WACV. Alternatively, spatial attention was exploited to drive better the adversarial training on unrealistic regions lin2021attention. Some methods focus instead on generating intermediate representations of source and target gong2019dlow; lira2020ganhopper or continuous translations pizzati2021comogan; liu2021smoothing. In the recent gomez2020retrieval, authors exploit similarity with retrieved images to increase translation quality.

2.2 Disentanglement in i2i

Disentangled representations of content and appearance seem to be an emerging trend to increase i2i outputs quality. Recently, Park et al. park2020contrastive proposed a contrastive learning based framework to disentangle content from appearance based on patches. MUNIT huang2018multimodal, DRIT lee2019drit++ and TSIT jiang2020tsit exploit disentanglement between content and style to achieve one-to-many translations. The idea is further extended in FUNIT liu2019few

to achieve few-show learning. In COCO-FUNIT 

saito2020coco, style extraction is conditioned on the image content to preserve fine-grained consistency. In HiDT anokhin2020high, they exploit multi-scale style injection to reach translations of high definition, while xia2020unsupervised; lin2019exploring conditions disentanglement on domain supervision. Following different reasoning, kondo2019flow disentangles representations enforcing orthogonality. In jia2020lipschitz, they prevent semantic entanglement by using gradient regularization.

Multi-domain i2i methods choi2018stargan; romero2019smit; anoosheh2018combogan; yang2018crossing; hui2018unsupervised; wu2019relgan; nguyen2021multi could be also exploited for disentangling representations among different domains, at the cost of requiring annotated datasets with separated physical characteristics – practically inapplicable for real images. Recent frameworks yu2019multi; choi2020starganv2 unify multi-domain and multi-target i2i exploiting multiple disentangled representations. Some works singh2019finegan; li2021image, detach from literature proposing hierarchical generation. In bi2019deep, instead, they learn separately albedo and shading, regardless of the general scene. A similar result is performed by liu2020unsupervised, only using unpaired images. Recently, also VAE-based alternatives have emerged bepler2019explicitly.

Disentangled representations could also help in physics-driven i2i tasks, such as yang2018towards where a fog model is exploited to dehaze images. Similarly, Gong et al. gong2020analogical perform fog generation exploiting paired simulated data. Even though these methods effectively learn physical transformations in a disentangled manner, they simply ignore the mapping of other domain traits.

2.3 Physics-based generation

Many works in literature rely on rendering to generate physics-based traits in images, for rain streaks garg2006photorealistic; halder2019physics; tremblay2020rain; weber2015multiscale; rousseau2006realistic, snow barnum2010analysis, fog sakaridis2018semantic; halder2019physics or others. In many cases, physical phenomena cause occlusion of the scene – well studied in the literature. For instance, many models for raindrops are available, exploiting surface modeling and ray tracing roser2009video; roser2010realistic; hao2019learning. In you2015adherent, raindrop motion dynamics are also modeled. Recent works instead focus on photorealism relaxing physical accuracy constraints porav2019can; alletto2019adherent. A general model for lens occluders has been proposed in gu2009removing. Logically, it is extremely challenging to entirely simulate the appearance of scene encompassing multiple physical phenomena (for rain: rain streaks, raindrops on the lens, reflections, etc.), hence in tremblay2020rain they also combine i2i networks and physics-based rendering. However, this is quite different from our objective since they assume to physically model features not present in the target images. To the best of our knowledge, there is no method which unifies rendering based on physical models and i2i translations in a complementary manner.

Figure 2: GAN-guided disentanglement. We exploit here a separate frozen GAN () which renders specific target traits (here, dirt) on generator  output images before forwarding them to the discriminator . This further impacts disentanglement in .
Figure 3: Model-guided disentanglement. Our unsupervised disentanglement process consists of applying a physical model  to generated image . Subsequently, the composite image is forwarded to the discriminator and the GAN loss ( or 

) is backpropagated (dashed arrows). The model rendering depends on the estimated parameters 

, composed by differentiable () and non-differentiable ones (). We use a Disentanglement Guidance (DG) to avoid interfering with the gradient propagation in the learning process. Green stands for real data, red for fake ones.

3 Guided disentanglement

The motivation of our work is that standard i2i GANs solely rely on context mapping between source and target only. In some setups, however, target domain encompasses known visual traits partially occluding the scene, for example adverse weather or lens occlusions, thoroughly studied in the literature. Hence, it may be amenable to integrate a priori models in the adversarial learning process to boost performances.

To model i2i transformations as a composition of characteristics, we propose a disentangled setting where the i2i GAN compensates the missing (unmodeled) traits to recreate the complete target style. Disentanglement is achieved relying either on a full neural GAN-guided shown in Fig. 2 (Sec. 3.2) or physics model-guided shown in Fig. 3 (Sec. 3.3) setting. In the latter, we exploit existing physical models for disentanglement, using as only prior the nature of physical trait we aim to disentangle (e.g. raindrop, dirt, fog, etc.). Furthermore, we estimate target parameters of the physical model (Sec. 3.3.1) to ease disentangled learning as well as to reduce differences with target. Our approach boosts image quality and realism guiding model injection during training with gradient-based guidance (Sec. 3.3.3). An extensive explanation of training strategies is in Sec. 4.

3.1 Adversarial disentanglement

In image-to-image translation we aim to learn a transformation between a source  and a target , thus mapping  in an unsupervised manner. We assume that  appearance is partly characterized by a well-identified phenomenon such as occlusions on the lens (e.g. rain, dirt) or weather phenomena (e.g. fog). Hence, we propose a sub-domain decomposition pizzati2019domain of , separating the identified traits () from the other ones (). We assume this only on target, so 

. In adversarial learning, the task of the generator is to approximate the probability distributions 

and  associated with the problem domains, such as


For simplicity, we assume that the traits identifiable in this manner are independent from the recorded scene. For instance, physical properties of raindrops on a lens (such as thickness or position) do not change with the scene, as it happens also with fog, where visual effects are only depth-dependent. Therefore,  is fairly independent from , hence we formalize 

as a joint probability distribution with independent marginals, such as


Intuitively, modeling one of the marginals with a priori knowledge will force the GAN to learn the other one in a disentangled manner. During training, this translates into injecting features belonging to  before forwarding the images to the discriminator, which will provide feedback on the general realism of the image.

Formally, we modify a LSGAN mao2017least training, which enforces adversarial learning minimizing


where  and  are tasks of generator  and discriminator , respectively. We instead learn a disentangled mapping injecting modeled traits  on translated images. We newly define  as the disentangled composition of translated scene  and , hence


We define as  a pixel-wise measure of blending between modeled and generated scene traits. Pixels which depend only on  (as opaque occlusions) will show  while others (e.g. transparent ones) will have . We detail hereafter several  estimation strategies relying either on GAN or physical models whether it is differentiable or not.

3.2 GAN-guided disentanglement

Let’s first consider the simple case where  is the output of a separate GAN we denote . Following previous explanations, assuming  models specific visual traits – may it be dirt, drop, watermark or else – it is an approximation of the marginal . We define as the optimal set of parameters of the network to reproduce target occlusion appearance. Subsequently, processing generated images with  before forwarding them to the discriminator pushes the guided-GAN (not to be confused with ) to achieve disentanglement, as illustrated in Fig. 2. We provide further implementation details in Sec. 4.3.

Of importance here, even if  is trained supervisedly – for example, from annotated pairs of images / dirt – the disentanglement strategy is itself fully unsupervised. However, this in fact only matches a rather restricted range of scenarios, because  needs to resemble target traits to enable disentanglement. In other words, referring to Eq. 2, the guided-GAN can only achieve disentanglement and estimate  from images in , if  (i.e. ) correctly estimates . Suppose  augments rain on images, such GAN-guided strategy will be sensitive to the intensity as well as the appearance of drops of . Subsequently, disentanglement will fail if for example  generates in-focus drop while target encompasses out-of-focus drops since both have drastically different visual appearance (visible in Fig. 1, bottom).
Instead, we now introduce a model-guided disentanglement to re-inject physical traits of arbitrary appearance, greatly increasing the generative capabilities of our guided framework.

3.3 Model-guided disentanglement

(a) Differentiable parameters.
(b) Joint differentiable and non-differentiable strategy.
Figure 4: Model-guided parameters estimation. fig:advest) We exploit a pretrained discriminator , to calculate an adversarial loss  on source data augmented with the model  having differentiable parameters . In this process, the gradient flows only in direction of the differentiable parameters. fig:cmaes) We optimize until convergence differentiable (blue) and non-differentiable (purple) parameters, alternatively reaching new minima ( and ) used during optimization of the other parameter set. While differentiable parameters are regressed (Sec. 3.3.1), non-differentiable ones require black-box genetic optimization (Sec. 3.3.2), here CMA-ES hansen2003reducing.

Alternately, one can formalize guidance as a physical model, denoted , from existing literature – typically to render visual traits like drops, fog, or else. Injecting such physical models in our guided-GAN enables disentanglement of visual traits not rendered by physical models, like wet materials for rain models halder2019physics clouds in the sky for fog models sakaridis2018semantic, etc.

Because physical models may have extremely variable appearance depending on their parameters , we propose adversarial-based strategies to regress optimal  mimicking target dataset appearance. This is in fact needed for disentangled training where we assume modeled traits to resemble target ones. Other parameters are of stochastic nature (e.g. drop positions on the image) and are encoded as a noise  regulating random characteristics. Additionally, some models appearance – like refractive occlusions – vary with the underlying scene111In Sec. 5, we explain how  depending of  is not violating the independence assumption of Eq. 2, and evaluate its effect in Sec. 4.4. , so we write , with . Following our model-guided pipeline in Fig. 3, if  properly estimates target physical parameters,  estimates marginal  which again enables disentanglement.

During inference instead,  and  can be arbitrarily varied, greatly increasing generation variability while still obtaining realistic target scene rendering. In the following, we describe our adversarial parameter estimation strategy, while distinguishing differentiable () and non-differentiable () parameters, such that .

3.3.1 Adversarial differentiable parameters estimation

To estimate the target optimized derivable parameters , we exploit an adversarial-based strategy benefiting from entanglement in naive trainings. Considering a naive baseline trained on  mapping, where target entangles two sub domains, the entangled discriminator  successfully learns to distinguish fake target images. This results in being able to discriminate  from . Considering a simplified scenario where  is arbitrarily confused with source domain, such that , regressing  is the only way to minimize the domain shift. In other words, considering the derivable model parametrized by , the above domains confusion prevents any changes in the scene. To minimize differences between source and target the network is left with updating the injected physical model appearance, ultimately regressing .

Fig. 3(a) shows our differentiable parameter pipeline. From a training perspective, we first pretrain an i2i baseline (e.g. MUNIT huang2018multimodal), learning an entangled  mapping. We then freeze the entangled discriminator () and use it to solve


backpropagating the GAN loss through the differentiable model. Since many models may encompass pixelwise transparency, often the blending mask  is . Freezing the discriminator is mandatory as we aim to preserve the previously learned target domain appearance during the estimation process. After convergence, we extract the optimal parameter set . Alternatively,  could be manually tuned by an operator, at the cost of menial work and inaccuracy, possibly leading to errors in the disentanglement.

From Fig. 3(a), notice that the gradient flows only through differentiable parameters (). We now detail our strategy to optimize jointly inevitable non-differentiable parameters ().

3.3.2 Genetic non-differentiable parameters estimation

The previously described strategy only holds for differentiable parameters , since we use backpropagation of an adversarial loss. Nonetheless, many models include non-differentiable parameters  that could equally impact the realism of our model . For example, a model generating raindrops occlusion would include differentiable parameters like the imaging focus, but also non-differentiable ones like the shape or number of drops – all of which significantly impact visual appearance. Incorrect sizing of non-differentiable parameters can lead to a wrong disentanglement. Manual approximation of optimal  parameters via trial-and-error might also be cumbersome or impractical for vast search space. To circumvent this, we exploit a genetic strategy estimating .

In our method, non-differentiable parameters are fed to a genetic optimization strategy. The evolutionary criteria remain the same as for differentiable parameters, that is the pretrained discriminator () adversarial loss. In practice, to avoid noisy updates after genetic estimation, we average adversarial loss over a fixed number of samples to reliably select a new population. After convergence, we extract the optimal parameter set . In our experiments, we use CMA-ES hansen2003reducing as evolutionary strategy, but the proposed pipeline is extensible to any other genetic algorithm.

Joint adversarial and genetic optimization. For models having differentiable and non-differentiable parameters we employ a joint optimization shown in Fig. 3(b). We first initialize a set of parameters , then alternatively use our adversarial strategy (Sec. 3.3.1) for differentiable parameters  and the genetic strategy for non differentiable ones . Notice that alternance prevents divergence due to simultaneous optimization. We apply updates until optimum, reaching the two sets of target style parameters, .

3.3.3 Disentanglement guidance

It is worth noting that too sparse injection of model negatively impacts disentanglement because the guided-GAN will entangle similar physical traits to fool the discriminator, while injecting too much of will prevent the discovery of the disentangled target. Spatially, we observe that regions that do not differ from source to target are most frequently impacted by entanglement. This is because the discriminator naturally provides less reliable predictions due to the local source-target similarities, which leads the generator to produce artifacts resembling target physical characteristics to fool the discriminator, eventually leading to unwanted entanglement. In rainy scenes this happens for trees or buildings, which appearance little vary if dry or wet, whereas ground or road exhibit puddles which are strong rainy cues.

To balance the injection of , we guide disentanglement by injecting only on low domains shift areas, pushing the guided-GAN to learn the disentangled mapping of the scene. Specifically, we learn a Disentanglement Guidance (DG) dataset-wise by averaging the GradCAM selvaraju2017grad feedback on the source dataset, relying on the discriminator  gradient on fake classification. Areas with high domain shift will be easily identified as fake, while others will impact less on the prediction. To take into account different resolutions, we evaluate GradCAM for all the discriminator layers. Formally, we use LSGAN to obtain


with  being the discriminator layers. At training, we inject models only on pixels  where , with 

a hyperparameter. In Sec. 

4.4 we visually assess the effect of DG.

4 Experiments

We thoroughly evaluate our guided disentanglement proposal on the real datasets nuScenes caesar2019nuscenes, RobotCar porav2019can

, Cityscapes 

cordts2016cityscapes and WoodScape yogamani2019woodscape, and on the synthetic Synthia ros2016synthia and Weather Cityscapes halder2019physics. The training setup is detailed in Sec. 4.1.1. We extensively test model-guided strategy in Sec. 4.2 on the disentanglement of raindrop, dirt, composite occlusions, and fog – relying on simple physics models. The GAN-guided strategy, requiring rare separate GAN network rendering occlusion traits, is subsequently evaluated in Sec. 4.3 on dirt disentanglement, relying on the recent DirtyGAN uricar2019let. The experiments are all evaluated on a qualitative and quantitative basis, relying on GAN metrics as well as proxy tasks. Our method is compared against the recent DRIT lee2019drit++, U-GAT-IT kim2019u, AttentionGAN tang2019attention, CycleGAN zhu2017unpaired, and MUNIT huang2018multimodal frameworks.

Opposite to the literature, our method enables disentanglement of the target domain, so we report both the disentangled translations as well as the translations with the injection of optimal target occlusions. The disentanglement is greatly visible in still frames of this section, but is also beneficial in video - without any temporal constraints - for our model-guided output (cf. supplementary video). In Sec. 4.2.3, we study the accuracy of our model parameters estimation on the well-documented raindrop model, and ablate our proposal in Sec. 4.4.

4.1 Methodology

Our model-/GAN- guided GAN is architecture agnostic. Here, we rely on MUNIT huang2018multimodal backbone for its multi-modal capabilities, and exploit LSGAN mao2017least for training.

For clarity, we formalize trainings as , where is GAN-guided disentanglement () or model-guided either with differentiable parameters only () or with full model (). When re-injecting occlusions, we also show their parameters in parentheses. For example, means model-guided disentangled output with re-injection of occlusions with full model estimated on target ().

4.1.1 Training

Figure 5: Training pipelines. For model-guided disentanglement, we 1) train a naive i2i entangled baseline, 2) use the entangled discriminator feedback to estimate optimal parameters and 3) Disentanglement Guidance (DG), and finally 4) train the guided-GAN with model injection. For GAN-guided disentanglement, we 1) train a GAN () exploiting additional knowledge as available semantic annotations and 2) use it to inject target traits during our guided-GAN training.

Fig. 5 shows our two training pipelines.
For model-guided training (Fig. 5, top), we leverage on a multi-step pipeline, only assuming the known nature of features to disentangle (e.g. raindrop, dirt, fog, etc.). First, an i2i baseline is trained in an entangled manner, obtaining entangled discriminator (). Second, we make use of to regress the optimal parameters with adversarial (Sec. 3.3.1) and genetic (Sec. 3.3.2) estimation. Third, we extract Disentanglement Guidance (Sec. 3.3.3), also using . Finally, we train from scratch the disentangled guided-GAN (Sec. 3.3).
For GAN-guided training (Fig. 5, bottom), we use a prior-agnostic two-step pipeline. First, we train to model elements with adversarial learning, exploiting semantic supervision in our experiments though it could realistically be replaced with self-supervision. Then, we train the disentangled guided-GAN without any supervision.

4.1.2 Tasks

Task Entanglement Datasets Guidance


Raindrop nuScenes caesar2019nuscenes Raindrop ,x4
Dirt WoodScape yogamani2019woodscape Dirt -
Fog Synthia ros2016synthia,Weather CS halder2019physics Fog -
Composite Synthia ros2016synthia Composite - -


Dirt WoodScape yogamani2019woodscape DirtyGAN uricar2019let
Table 1: Disentanglement tasks. For each task, we indicate the features entangled in the target domain (also, shorten as indices of task name), the datasets, and the model or GAN guidance employed for disentanglement.

Tab. 1 lists the tasks evaluated and ad-hoc datasets. When referring to a task, we denote as indices the entangled features in target domain. Thus, means translation from clear domain to rain domain having entangled drops. We later describe models used for disentanglement.

We exploit the recent nuScenes caesar2019nuscenes which includes urban driving scenes, and use metadata to build clear/rain splits obtaining 114251/29463 training and 25798/5637 testing clear/rain images. Target rain images entangle highly unfocused drops on the windshield, which would hardly be annotated as seen in Fig. 8, first row.

Here, we rely on the recent fish-eye WoodScape yogamani2019woodscape dataset which has some images with soiling on the lens. We separate the dataset in clean/dirty images using soiling metadata getting 5117/4873 training images and 500/500 for validation. Because clean/dirty splits do not encompass other domain shifts, we additionally transform clean images to gray

. Subsequently, we frame this as a colorization task where target

color domain entangles dirt. For disentanglement, we experiment using both a physic model-guided and a GAN-guided strategy.

With Synthia ros2016synthia we also investigate entanglement of very different alpha-blended composites, like "Confidential" watermarks or fences. We split Synthia using metadata into clear/snow images and further augment snow target with said composite at random position. As clear/fog splits, we use 3634/3739 images for training and 901/947 for validation. To guide disentanglement, we consider a composite model, inspiring from the concept of thin occluders garg2006photorealistic.

We learn here the mapping from synthetic Synthia ros2016synthia to the foggy version of Weather Cityscapes halder2019physics – a foggy-augmented Cityscapes cordts2016cityscapes. The goal is to learn the synthetic to real mapping, while disentangling the complex fog effect in target. For training we use 3634/11900 and 901/2000 for validation as Synthia/WeatherCityscapes. We use a fog model to guide our network.
Note that this task differentiates from others, since target has fog of heterogeneous intensities (max. visibility 750, 375, 150 and 75m) making disentanglement significantly harder.

4.2 Model-guided disentanglement

We first detail the models used for guidance (Sec. 4.2.1) of the model tasks in Tab. 1 and then evaluate the disentanglement performance. Since non-differentiable parameters were fairly easy to manually tune, we thoroughly experiment in the differentiable-only setup (Sec. 4.2.2), and later compare it to our full estimation (Sec. 4.2.3).

4.2.1 Guidance

To correctly fool the discriminator, it is important to choose a model that realistically resembles the entangled feature. We now detail the four models implemented for the model tasks in Tab. 1, detailing differentiable () and non-differentiable () parameters. An ablation of the models is done in Sec. 4.4.

Raindrop model. Fig. 6 illustrates our drop occlusion model extending the model of Alletto et al. alletto2019adherent, which is balanced between complexity and realism. Drops are approximated by simple trigonometric functions, while we encompass also noise addition for shape variability shadertoy. For drops photometry, we use fixed displacement maps for coordinate mapping on both x and y axes, technically encoded as 3-channels images alletto2019adherent. To approximate light refraction, a drop at has its pixel mapped to


where is a drop-wise value representing water thickness. Most importantly, we also model imaging focus, since it may extremely impact the rendered raindrop appearance halimeh2009raindrop; cord2011towards; alletto2019adherent. Hence, we use a Gaussian point spread function pentland1987new

to blur synthetic raindrops. We implement kernel variance

as differentiable, while drops size (), frequency (), and shape () related parameters are non differentiable. We use a single shape parameter and generate 4 types of drops, with associated and .

Dirt model. Here, we naively extend our raindrop model removing displacement maps as soil has no refractive behaviors. Instead, we introduce a color guidance that forces synthetic dirt to be brighter in peripherals regions, also depending on a parameter which regulates occlusion maximum opacity (hence, maximum value). We also estimate as aforementioned. Sample outputs are in Fig. 7.

Composite occlusions model. We exploit the model of thin occluder proposed in garg2006photorealistic to render composite occlusions on images, i.e. randomly translated alpha-blended transparent images such as watermarks or fence-like grids. We assume to fully know transparency, thus no parameter is learned.

Fog model. We leverage the physics model of halder2019physics using an input depth map. Fog thickness is regulated by a differentiable extinction coefficient which regulates maximum visibility.

Figure 6: Raindrop model. We extend the model of alletto2019adherent with displacement maps of variable shape and size (left) to model light refraction through, and apply a gaussian blur kernel with variance to render out-of-focus appearance (right).
Figure 7: Dirt model. Sample occlusions of our model on WoodScape yogamani2019woodscape

show the rendered color depends on the soiling transparency. The full degrees of freedom of our dirt model are transparency (

) and defocus blur ().
























Model-guided valign=m












Figure 8: Raindrop disentanglement on . We compare qualitatively with the state-of-the-art on the task with rain drops model-based disentanglement. In the first row, we report samples of the target domain. Subsequently, the Source image (2nd row), the translations by different baselines (rows 3-7) and our results (rows 8-11). Our model-guided network is able to disentangle the generation of peculiar rainy characteristics from the drops on the windshield (‘disentangled’ row ). In the last rows, we re-inject droplets with the estimated parameters representing the target style (‘Target-style’ row ) or other arbitrary styles , (last 2 rows).








Model-guided valign=m






GAN-guided valign=m






Figure 9: Dirt disentanglement on . We compare with MUNIT huang2018multimodal for the task. Although MUNIT successfully mimics the Target style (rows 1,3), both model-guided and GAN-guided approaches lead to a more realistic image colorization disentangling the presence of dirt (‘Disentangled’ rows , ). Furthermore, we compare dirt generation on the lens with both of our strategies (‘Target-style’ rows , ).

4.2.2 Disentanglement evaluation

In this section, we separate experiments on Raindrop, Dirt and Composite disentanglement from the Fog experiments, since only the former have homogeneous physical parameters () throughout the dataset222For Raindrop, Dirt and Composite we consider and to be dataset-wise constant. E.g. all raindrops have the same defocus blur, transparency, etc. Conversely, Fog images have varying fog intensity..

Qualitative disentanglement.

We present different outputs for the trained on nuScenes caesar2019nuscenes, comparing to state-of-the-art methods lee2019drit++; kim2019u; tang2019attention; zhu2017unpaired; huang2018multimodal (Fig. 8) and for and with respect to the backbone (Figs. 9,10, respectively). In all cases, baselines entangle occlusions in different manners. For instance, in Fig. 8 it is noticeable the constant position of rendered raindrops between different frameworks, as in the 4th column on the leftmost tree, which is a visible effect of entanglement and limits image variability. Also, occlusion entanglement could cause very unrealistic outputs where the structural consistency of either the scene (Fig. 9) or the occlusion (Fig. 10) is completely lost.

Referring to Figs. 8,9,10, our method is always able to produce high quality images without occlusions (‘disentangled’ row ) including typical target domain traits such as wet appearance without drops, colored image without dirt or snowy image without occlusions, respectively. Furthermore, we can inject occlusions with optimal estimated parameters (‘Target-style’ row ) to mimic target appearance which enables a fair comparison with baselines333For comparing with supervised methods we set (cf. Sec. 4.3)..

We also inject raindrops with arbitrary parameters to simulate unseen dashcam-style images in Fig. 8 (last 2 rows). The realistic results demonstrate both the quality of our disentanglement and the realism of the Raindrop model.

Experiment Network IS LPIPS CIS
CycleGAN zhu2017unpaired 1.15 0.473 -
AttentionGAN tang2019attention 1.41 0.464 -
U-GAT-IT kim2019u 1.04 0.489 -
DRIT lee2019drit++ 1.19 0.492 1.12
MUNIT huang2018multimodal 1.21 0.495 1.03
Model-guided 1.53 0.515 1.15
MUNIT huang2018multimodal 1.06 0.656 1.08
Model-guided 1.25 0.590 1.15
GAN-guided 1.58 0.663 1.47
MUNIT huang2018multimodal 1.26 0.547 1.11
(fence) Model-guided 1.31 0.539 1.19
MUNIT huang2018multimodal 1.17 0.567 1.01
(WMK) Model-guided 1.19 0.551 1.02
MUNIT huang2018multimodal 1.22 0.429 1.13
Model-guided 1.33 0.420 1.17
(a) GAN metrics.
Method AP
Original (from halder2019physics) 18.7
Finetuned w/ Halder et al. halder2019physics 25.6
Finetuned w/ Model-guided 27.7
(b) Semantic segmentation
MUNIT huang2018multimodal 0.414 13.4
Model-guided 0.755 20.2
GAN-guided 0.724 19.3
(c) Colorization
Table 2: Disentanglement evaluation. In (table:ganmetrics), we quantify GAN metrics for all tasks. While quality-aware metrics are always successfully increased, LPIPS depends on the visual complexity of the model and presence of artifacts. In (table:semantic), we compare our pipeline for finetuning semantic segmentation network outperforming the state-of-the-art for rain generation. Finally, in (table:colorization), we compare our supervised and unsupervised pipeline for unpaired colorization, where both outperform baseline MUNIT huang2018multimodal.
Fence WMK








Model-guided valign=m




Target style


Figure 10: Composite disentanglement on . We extend the applicability of our method to composite occlusions, that we validate in the scenario. We add a fence-like occlusion (left) and a confidential watermark (right) to synthetic_snow, with random position. As expected, we encounter entanglement phenomena for MUNIT, while our model-guided network is successful in learning the disentangled appearance (‘Disentangled’ row ). In our ‘Target-style’ row , we inject the occlusions to mimic the target style.








Model-guided valign=m









Figure 11: translations. As visible, MUNIT shows entanglement phenomena, leading to artifacts. Our model-guided disentanglement, instead, enables to generate a wide range of foggy images, with arbitrary visibility, while mantaining realism. Since the fog model always blocks the gradient propagation in the sky region, the network can not achieve photorealistic disentanglement but still improves the generated image quality.
Source Target Porav et al. porav2019can
(a) Sample images
Porav et al. porav2019can 207.34 0.53
Model-guided 135.32 0.44
Model-guided 157.44 0.43
(b) Benchmark on porav2019can
(c) FID
Figure 12: Realism of the injected occlusion. Our defocus blur estimation grants an increased realism in raindrop rendering on the RobotCar porav2019can dataset (fig:porav-qualitative), compared with Porav et al. porav2019can. This is confirmed by quantitative metrics (fig:porav-dist). We report our model-guided translations using either differentiable parameter estimation only () or the full model estimation (), outperforming Porav et al. porav2019can in both. In (fig:porav-fid), we evaluate the FID for different values in [0, 10], showing that our regressed value () actually leads to a local minimum.

Defocus blur ()

(a) Raindrop

Transparency ()

(b) Dirt

Optical extinction ()

(c) Fog
Figure 13: Evaluation of the model parameters regression. The reliability of our parameter estimation is assessed on synthetic datasets augmented with arbitrary physical models acting as ground truth values. Comparing against our regressed value, our strategy performs better when low modifications on the estimated values corresponds to big visual changes (average error is 0.99% for raindrops (plot:drops), 3.55% for dirt (plot:dirt)). For fog (plot:fog), we get an higher error of 23.51% due to the low visual impact of high values.
Quantitative disentanglement.

We use GAN metrics to quantify the quality of the learned mappings. Results are reported in Tab. 9(a), where Inception Score (IS) salimans2016improved evaluates quality and diversity against target, LPIPS distance zhang2018unreasonable evaluates translation diversity (thus avoiding mode-collapse), and Conditional Inception Score (CIS) huang2018multimodal single-image translations diversity for multi-modal baselines. In practice, IS is computed over all the validation set while CIS is estimated on 100 different translations of 100 random images following huang2018multimodal. The InceptionV3 network for Inception Scores was finetuned on the source/target classification as in huang2018multimodal. LPIPS distance is calculated on 1900 random pairs of 100 translations as in huang2018multimodal. For fairness, we only compare our ‘Target-style’ output to baselines, since those are not supposed to disentangle occlusions, and can only output occlusions resembling Target.
Tab. 9(a) shows we outperform all baselines on IS/CIS, including MUNIT – our i2i backbone. This is due to disentanglement, since entanglement phenomena limit occlusions appearance and position variability. Even the scene translation quality is improved by disentanglement since the generator learns a simpler target domain mapping without any occlusions. As regards LPIPS distance, we outperform the baseline on raindrops while we rank lower on the other tasks. While IS/CIS quantify both quality and diversity, LPIPS metric is evaluating variability only thus penalizing simpler occlusion generation. For instance, our rendered dirt in Fig. 9 is often black while MUNIT-generated artifacts are highly variable (compare rows MUNIT and ours ). The same happens for watermarks in Fig. 10, where unrealistic artifacts are highly variable. For raindrops, instead, MUNIT tends to just blur images, while we benefit from the refractive capabilities of our model which increase LPIPS.

Semantic segmentation. To provide additional insights on the effectiveness of our framework and compensate for the well-known noisiness of GAN metrics zhang2018unreasonable, we quantify the usability of generated images for semantic segmentation. Therefore, we process the popular Cityscapes cordts2016cityscapes dataset for semantic segmentation with our model-guided training, obtaining a synthetic rainy version that we use for finetuning PSPNet zhao2017pyramid, following Halder et al. halder2019physics. Please note that this also demonstrates the generation capabilities to new scenarios of our GAN, since we use the pretrained network on nuScenes given the absence of rainy scenes in Cityscapes. We report the mAP for the 25 rainy images with semantic labels provided by halder2019physics in Tab. 9(b). We experience a significant increase () with respect to baseline PSPNet trained on original clear images (Original), and also outperform () the finetuning with rain physics-based rendering halder2019physics. Both networks finetune Original weights. The overall low numbers reported are impacted by the significant domain shift between Cityscapes and nuScenes.

Disentanglement on heterogeneous datasets.

We now evaluate the effectiveness of the experiment which translates from synthetic Synthia to the real-augmented Weather CityScapes halder2019physics entangling fog of various intensities (from light to thick fog). Notice this task significantly differs from others for two reasons. First, unlike other experiments the model parameter – the optical extinction coefficient, – varies in the target dataset. Second, the fog model is depending on the scene geometry narasimhan2002vision. This makes the disentanglement task non-trivial. In our adversarial disentanglement, we however still regress a single somehow averaging the ground truth values ().

In Fig. 11 results show we are able to generate images stylistically similar to target ones, but with geometrical consistency and varying (last 3 rows). Instead, MUNIT huang2018multimodal fails to preserve realism due to entanglement artifacts, visible in particular on elements at far (as buildings in the background). Please note that we intentionally do not show disentangled output for fairness, since the physical model always blocks the gradient propagation in the sky. More details on this will be discussed in Sec. 5. Randomizing we report GAN metrics results in Tab. 9(a), where the increased quality of images is quantified. LPIPS distance suffers from the absence of artifacts in our model-guided , which artificially increases image variability. The physical model always renders correctly regions at far (e.g. the sky, which is always occluded), hence pure variability quantified by LPIPS is reduced (cf. above LPIPS definition).

4.2.3 Adversarial parameters estimation

We now evaluate the effectiveness of our parameter estimation, considering only differentiable parameters first and later extending to our full system.

Differentiable model (). To evaluate realism, we leverage the RobotCar porav2019can dataset having pairs of clear/raindrop images. Since there is no domain shift between image pairs, we set and regress the defocus blur () again following Sec. 3.3.1. The regressed is used to render raindrops on clear images. Using FID and LPIPS distances we measure perceived distance between real raindrop images and our model-guided raindrops translations () or the one of Porav et al. porav2019can. Fig. 11(b) shows we greatly boost similarity444Please note that unlike previous experiments, here LPIPS is used for distance estimation (not diversity), so lower is better. ( LPIPS) with real raindrop images. This is qualitatively verified in Fig. 11(a), where our rendered raindrops are more similar to Target. To provide insights about the quality of our minima, we also evaluate FID for arbitrary values (). Fig. 11(c) proves that our estimated sigma best minimized perceptual distances despite the weak discriminator signal.

To measure the accuracy of our differentiable parameter regression (Sec. 3.3.1) we need paired non-occluded/occluded images having occlusions of known physical parameters. To the best of our knowledge such dataset does not exists. Instead, we augment RobotCar porav2019can, WoodScape yogamani2019woodscape and Synthia ros2016synthia with synthetic raindrops, dirt, and fog, respectively, with gradually increasing values of defocus blur () for raindrop, transparency () for dirt555In this experiment, we consider dirt with a fixed defocus blur value and regress only to increase the diversity of tasks. and optical thickness () for fog. Using each augmented dataset, we then regress said parameters following Sec. 3.3.1.

Plots in Fig. 13 show estimation versus ground-truth. In average, the estimation error is for raindrop, for dirt, and for fog. The very low

error for raindrop is to be imputed to the defocus blur that drastically changes scene appearance, while higher error for

must be imputed to the logarithmic dependency of the fog model. Nevertheless, translations preserve realism (cf. Fig. 11).

Full model (). To evaluate the quality of our full raindrop model, we incorporate this time the non-differentiable parameters (i.e. , , ) which are estimated with our genetic strategy in Sec. 3.3.2 for 4 types of drops, with a genetic population size of 10. As shown in Fig. 11(b), LPIPS metric privileges our full model-guided estimation () while FID suffers compared to using differentiable parameters only. However, we very significantly outperform porav2019can also qualitatively (Fig. 11(a)). The mitigated results are explained by the much more complex optimization problem having many more parameters, and by the limited computation time for genetic iterations. However, this let us foresee applications in high-dimensionality problems where manual approximation is not always possible or with a less accurate model (see ablations Sec. 4.4).
Results on the task in Fig. 14 are coherent with above insights as the full model estimation, although effective, exhibits slightly lower quality disentanglement.

Source MUNIT huang2018multimodal
Figure 14: Full model estimation on . With complete parameter estimation (, rightmost), we achieve a slightly worse disentanglement than with manually-tuned non-differentiable parameters (), visible in red areas of . However, in both of our translations we generate typical rain traits as reflections with reasonable disentanglement, while baseline MUNIT huang2018multimodal has very evident raindrops entangled highlighted in red.

4.3 GAN-guided disentanglement

Different from model guidance, we now evaluate our method but relying on our GAN-guided strategy for dirt disentanglement in the task.

4.3.1 Guidance

For guidance, we exploit DirtyGAN uricar2019let, a GAN-based framework for opaque soiling occlusion generation. It is composed by two components, i.e. a VAE for occlusion map generation (trained using soiling semantic maps) and an i2i network conditioned on the generated map to include synthetic soiling on images. To train DirtyGAN, we first train a VAE to learn the shape of soiling, and then proceed to train a modified CycleGAN zhu2017unpaired to generate realistic soiling, conditioning the soiling shape on the VAE outputs. For more details on this we refer to uricar2019let.

4.3.2 Disentanglement evaluation

We leverage here WoodScape yogamani2019woodscape datasets having soiling semantic annotation as polygons. Following the GAN-guided pipeline (Fig. 2), DirtyGAN uricar2019let is trained beforehand and frozen during the disentanglement.

Last 2 rows of Fig. 9 show our GAN-guided strategy produces high quality colored images without occlusions (‘disentangled’ row ) while injection of occlusions with optimal estimated parameters (‘Target-style’ row ) also mimics target appearance with an increased variability compared to MUNIT. The use of annotations boosts the overall quality and diversity, which is proved in Tab. 9(a) where our GAN-guided outperforms both MUNIT baseline and our own model-guided version. Furthermore, since the ground truth for colorization is available, we evaluate in Tab. 9(c) the effectiveness of disentanglement with SSIM and PSNR metrics (higher is better). While both guided disentanglement outperform MUNIT huang2018multimodal very significantly, it is noticeable that GAN-guided performs less than model-guided. Arguably, we attribute this to the worse gradient propagation due to more occluded pixels with respect to our physical model666On average, DirtyGAN dirt covers 25.4% of the image while our physical model covered 20.1%. While this provides more realistic dirt masks (ground truth annotation is 29.6%) we conjecture this leads to worse gradient propagation..

In the last rows of Fig. 9, we compare visually our model-/GAN-guided disentanglement. Please note that even if the GAN-guided variant captures better variability shape and color of the target occlusions, the model-guided strategy benefit from the defocus blur estimation, ignored in DirtyGAN.

4.4 Ablation studies

none 1.21 0.50 1.03
Gaussian 1.35 0.51 1.13
Refract 1.46 0.50 1.12
Ours 1.53 0.52 1.15
(a) Model complexity
Source (i.e. baseline) (ours)
(b) Disentanglement Guidance
Figure 15: Ablations of model complexity and Disentanglement Guidance. In (fig:ablation-complexity), we quantify disentanglement effects with simpler model having less variability (Refract), or only color guidance (Gaussian). Even if complexity is beneficial for disentanglement (Ours), simple models permits disentanglement to some extent. In (fig:ablation-beta), we study the efficacy of the Disentanglement Guidance (DG) for different values on task. With our approach fallbacks to the baseline and entangles occlusions, while with guidance the translation lacks important features such as reflections and glares. With we simultaneously avoid entanglements and preserve translation capabilities.

We now ablate our proposal. Since GAN-guided cannot be ablated, we focus on the model-guided setting by increasing genetic processing, altering the model complexity, changing the models, or removing disentanglement guidance.

Non-differentiable genetic estimation.

We study the effectiveness of our genetic estimation ablating the population size of our raindrop model on RobotCar porav2019can as in Sec. 4.2.3. We test our algorithm with population size 10/25/50/100, obtaining FID 157.44/153.32/151.21/149.09 and LPIPS 0.43/0.44/0.44/0.43. While we observe an obvious increase in performances, this comes with additional computation times, hence we used the lowest population size of 10 for all tests. Nevertheless, this opens doors to potential improvements in the full parametric estimation.

Model complexity.

We study the influence of the model on disentanglement for the on nuScenes caesar2019nuscenes task. Specifically, we evaluate three raindrop models of decreasing complexity: 1) Our model from Sec. 4.2.1 (named Ours). 2) The same model but without shape and thickness variability (Refract), and 3) A naive non-parametric colored Gaussian-shape model (Gaussian). Note that Gaussian is deprived of any refractive property as it uses fixed color, and does not regress any physical parameters.

In Fig. 14(a), we report GAN metrics for all models translations following Sec. 4.2. Even if increasing complexity of the model is beneficial for disentanglement, very simple models still lead to a performance boost. We advocate the best performances of Ours to a more effective discriminator fooling during training, as a consequence of increased realism.

Model choice.

To also evaluate whether injected features only behave as adversarial noise regardless of the chosen model, we trained on RobotCar porav2019can (as in Sec. 4.2.3) though purposely using an incorrect model as watermark, dirt, fence. Evaluating the FID against real raindrop images, we measure (raindrop) / (watermark) / (dirt) / (fence), proving necessity of using the ad-hoc model.

Disentanglement Guidance (DG).

We use the nuScenes task to visualize the effects of different DG strategies (Sec. 3.3.3). For varying values of the DG threshold in Fig. 14(b) we see results ranging from no guidance () to strict guidance (). With lax guidance (), we fall back in the baseline scenario with visible entanglement effects, while with we do achieve disentanglement, at the cost of losing important visual features as reflections on the road. Only appropriate guidance () achieves disentanglement and preserves realism.

5 Discussion

To our best knowledge, we have designed the first unsupervised strategy to disentangle physics-based features in i2i. The good qualitative and quantitative performance showcase promising interest for several applications, still they are peculiar points and limitations which we now discuss.

Independence assumption.

For unsupervised disentanglement, we assume the physical model to be completely independent from the scene, in order to use our intuition about marginal separation (see Sec.  3.1 and Eq. 2). However, since physical models may need the underlying scene to correctly render desired traits, one may argue their appearance is not completely disentangled. While this is true from a visual point of view, it is not from a physical one. Let’s interpret disentanglement properties to be dependent on scene elements. In presence of disentanglement, the same physical model could be applied to different objects regardless of what they are. For instance, we could use the same raindrop refraction map on either roads or buildings with identical parameters. In this sense, dependency in physical models is not impacting our visual independence assumption.

On partial entanglement issues.

We observe in some cases that gradient propagation can be affected by fixed entanglement of occlusion features. This is the case for example for sky regions in fog (Sec. 4.2.2) because physics narasimhan2002vision formalizes that regardless of its intensity fog is always entangled at far. In such scenarios, disentanglement will perform poorly because the generator will not get any discriminative feedback. In many other cases however, Disentanglement Guidance (DG, Sec. 3.3.3) mitigates the phenomenon as it blocks injection of the physical model in relevant image regions. We conjecture that the effectiveness could be extended by varying DG at training time to ensure a balanced gradient propagation.

On genetic estimation effectiveness.

The sub-optimal performances of our genetic estimation of are imputed to the much more complex search space, in which we vary all parameters of our physical model simultaneously. Although we did set fairly large search limits for , one could envisage a mixed training in which the search space is limited to reasonable hand-tuned limits. In this sense, genetic estimation of could be seen as a minimum mining technique, ensuring increased performances on the hand-tuned values.