OneGAN: Simultaneous Unsupervised Learning of Conditional Image Generation, Foreground Segmentation, and Fine-Grained Clustering

12/31/2019 ∙ by Yaniv Benny, et al. ∙ Tel Aviv University 24

We present a method for simultaneously learning, in an unsupervised manner, (i) a conditional image generator, (ii) foreground extraction and segmentation, (iii) clustering into a two-level class hierarchy, and (iv) object removal and background completion, all done without any use of annotation. The method combines a generative adversarial network and a variational autoencoder, with multiple encoders, generators and discriminators, and benefits from solving all tasks at once. The input to the training scheme is a varied collection of unlabeled images from the same domain, as well as a set of background images without a foreground object. In addition, the image generator can mix the background from one image, with a foreground that is conditioned either on that of a second image or on the index of a desired cluster. The method obtains state of the art results in comparison to the literature methods, when compared to the current state of the art in each of the tasks.



There are no comments yet.


page 3

page 7

page 8

page 14

page 15

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We hypothesize that solving multiple unsupervised tasks together, enables one to improve on the performance of the best methods that solve each individually. The underlying motivation is that in unsupervised learning, the structure of the data is a key source of knowledge and each task exposes a different aspect of it. We advocate for solving the various tasks in phases, where easier tasks are addressed first, and the other tasks are introduced gradually, while constantly updating the solutions of the previous sets of tasks.

The method consists of multiple networks that are trained end-to-end and side-by-side to solve multiple tasks. The method starts from learning background image synthesis and image generation of objects from a particular domain. It then advances to more complex tasks, such as clustering, semantic segmentation and object removal. Finally, we show the model’s ability to perform image to image translation. The entire learning process is unsupervised, meaning that no annotated information is used. In particular, the method does not employ class labels, segmentation masks, bounding boxes, etc. However, it does require a separate set of clean background images, which are easy to obtain in many cases.


Beyond the conceptual novelty of a method that treats single-handedly multiple unsupervised tasks, which were previously solved by individual methods, the method displays a host of technical novelties, including: (i) a novel architecture that supports multiple paths addressing multiple tasks, (ii) employing bypass paths that allow a smooth transition between autoencoding and generation based on a random seed, (iii) employing a multiplicity of discriminators, each dedicated to a specific path, (iv) the alternating paths, which backpropogate three tasks in each batch, (v) the dual mixup, which employs two random numbers in order to interpolate between latent representations at one hand and between reconstructed and generated images on the other, (vi) an adversarial perceptual loss, (vii) GLU layer-normalization (appendix), and more. Due to each of these novelties, as demonstrated in the ablation studies, we obtain state of the art results compared to the literature methods in each of the individual tasks.

2 Related work

Since our work touches on many tasks, we focus the literature review on general concepts and on the most relevant work. Generative models are typically based on Generative Adversarial Networks[9] or Variational Auto-Encoders[16]. In addition, these two can be combined [18]. Conditional image generation In conditional image generation, the generated image is conditioned on an initial variable, most commonly, the target class. CGAN[20] and InfoGAN[5]

proposed different methods to apply the condition on the discriminator. Our work is more similar to InfoGAN, since we do not use labeled data and the label is not linked to any real image and no conditional discriminator can be applied. The condition is maintained by a classifier that tries to predict the conditioned label and, as a result, forces the generator to condition the result on that label.

Semantic segmentation In semantic segmentation, the task is to classify the image pixels based on their class labels. For the supervised setting Unet[21], DeepLab[2], DeepLabV3[3], HRNet[26] have shown great performance leaps using a regression loss. For the unsupervised case, more creative solutions are considered. In [4, 13, 28, 27, 6, 24, 11] a variety of methods have been used including inpainting, learning feature representation, clustering or video frames comparison. In Clustering,deep learning methods are the current state of the art. JULE [29] and DEPICT [8], cluster based on a learned feature representation. IIC [11] trains a classifier directly.

The most similar approach to ours is FineGAN [23]

, which our generators and discriminators are based upon. However, there are many significant differences and additions: (i) We added a set of encoders which are trained to support new tasks. (ii) While FineGAN employs one-hot input, our generators use coded input, which is important for our autoencoding path. (iii) We added a skip connection, followed by a mixup module that combines the bypass tensor with the pre-image tensor. The mixup also allows passing only one of the tensors, making either the bypass or the pre-image optional in each flow. (iv) We employ single foreground representation instead of FineGAN’s double hierarchical representation. (v) Our model uses layer normalization 


instead of batch normalization, which better performs for large number of classes, small batch size, and alternating paths. (vi) We define a new normalization method for the generators, where GLU 

[7] activation layers were used as non-linear activations. (vii) We add many losses, regularization terms, and training techniques that were not used in FineGAN, many of which are completely novel, as far as we can ascertain. As a result, our work outperforms FineGAN in all tasks and is capable of performing new tasks that its predecessor could not handle.

Mixup [30] is a technique for applying a weighted sum between two or more latent variables in order to synthesize a new latent variable. We use it to mix latent variables that are part encoded by the encoder and part extracted from the one-hot priors. As far as we can ascertain, this is the first usage of mixup to merge information from different paths. Our mixup is applied during autoencoding in two different stages, hence the term dual-mixup.

3 Method

To solve the tasks of clustering, foreground segmentation, and conditional generation with minimal supervision, our method trains multiple neural networks side-by-side. The sub-networks specialize in different sub-tasks, based on their architecture, relative position to the other networks, and a set of suitable losses. Each task is solved by applying the networks in a specific task-dependent path (Sec. 

3.1). In order to simplify training, instead of training the compound network to solve all three tasks at once, we schedule the training process by phases, see Sec. 4. The phases are designed to train the network for a gradually increasing subset of tasks, starting from image-level tasks (generating images) to semantic tasks (semantic segmentation of the foreground, and semantic clustering) that benefit from the capabilities obtained in the first phase.

The architecture of the compound model is depicted in Fig. 2 and Fig. 2 and the layers of the sub-networks are detailed in the appendix. Our solution consists of two generators, three encoders and two discriminators.

Figure 1: Flow of the generation path. (a) The generators decode the four priors () and produce three separate images (foreground, background, mask) that are combined to the final image. (b) The generated image is encoded by three encoders to retrieve the latent variables and priors.
Figure 2: Flow of the reconstruction path introduced in Phase II of the training. The exact same sub-networks from Phase I are rearranged to produce an autoencoding path. The image is first encoded with the shape and style encoders, which produce latent codes for the foreground generator. The foreground generator then decodes the code to produce the foreground image and mask. With the mask, the background encoder encodes the image and the background generator decodes it to generate a background image. The image is finally reconstructed from the combined images. Between the encoders and the generators, the predicted shape and style codes, and those produced by the LUT (), are merged with a mixup module. Furthermore, the bypass tensor replaces the pre-image and the bypass is merged with the pre-image tensor as skip connections to improve the background mask generation and object shape representation.
Figure 1: Flow of the generation path. (a) The generators decode the four priors () and produce three separate images (foreground, background, mask) that are combined to the final image. (b) The generated image is encoded by three encoders to retrieve the latent variables and priors.

Generators  The two generators run in parallel to produce an output image, where each generator captures a different aspect of the generated image. The generators are conditioned on a two-level hierarchical categorization. Each category has a unique child class and a parent class

shared by multiple child classes. These classes are represented by the one-hot vectors

. An additional background one-hot vector affects the generation of the background images. Since there is a tight coupling between the class of the object (water bird, tropical bird, etc.) and the expected background, the typology of the background follows the coarse hierarchy, i.e. the parent class.


A fourth vector

is sampled from a multi-variate gaussian distribution to represent non-categorical features.

The background generator receives the background vector and and produces a background image . The foreground generator , receives the parent vector , child vector and the same used in the background generation and produces a foreground image and a foreground mask . The generator is optimized such that all foreground images with the same will have the same object shape and all images with the same will have a similar object appearance. The latent vector is implicitly conditioned to represent all non-categorical information, such as pose, orientation, size, etc. It is used in both the background and foreground generation, so that the images produced by both networks will merge into a coherent image.

The generators start by converting the one-hot vectors into code vectors, using a single fully connected layer, in other words, look up tables (LUT). Such an embedding is often used when working with categorical values.


The generators are then applied to produce the background image , the foreground image and the mixing mask . Each generator is a composition of sub modules applied back to back:


During autoencoding, the interface of the generators changes, due to the bypasses and the mixup module. Instead of , is passed, and instead of () we pass (), as described in Eq. 16.

The final generated image is:


where denotes an element-wise multiplication.

Encoders  Unlike FineGAN, our framework requires the use of encoders. There are three matching encoders: the background encoder , the content encoder , and the style encoder . They run in semi-parallel to predict both the latent codes () of an input image and the underlying one-hot vectors (). All encoders are fed with image as input and the background encoder is also fed with the mask . Since there is no annotated mask, the mask is produced according to Eq. 6 by the foreground generator, when fed by the encoding produced by the content and style encoders.


where (

) are prediction probabilities for the parent and child classes, (

) are vectors of sizes (

) defining the mean and variance to sample each element of (

) from a gaussian distribution, () are the bypasses depicted in Fig. 2 as skip connections between encoders and generators.

Discriminators  Following [23], the two discriminators are adversarial opponents on the outputs . The background discriminator receives either a real background image from the set , a fake background image generated by the background generator, or a real object image from the set . This discriminator has three tasks, with a separate output for each. The tasks are as follows: (i) patch-wise prediction if the input image is real or fake, annotated as . (ii) patch-wise prediction if the input image is a background image or not, annotated as . (iii) extract hidden layer output for the perceptual loss, annotated as .

The image discriminator receives real images from or generated fake images, and also has three tasks: (i) predict if the input image is real or fake, annotated as . (ii) predict the child class of the image, annotated as . (iii) extract hidden layer output for perceptual loss, annotated as . The classification task is not an adversarial one, and the foreground generator is cooperating with that task, instead of aiming to fool it. The first and third tasks lead the generators to generate real looking images and the second task encourages the generators to condition the generation on the child class.

During training, when entering Phase II, the fake images for both discriminators can be a result of either (i) the generation path, (ii) fake image autoencoding, or (iii) real image autoencoding. These paths can generate very different images and the discriminators might not be able to optimize them simultaneously since, each path is practically a different domain. We noticed that images from the autoencoding path fail to converge to real-looking images when the discriminators are trained only by the generation path outputs. To solve this, upon entering Phase II, we clone each discriminator () twice and associate one separate clone for each path, resulting in a total of three background discriminators and another three for the foreground. In this setting, each path receives the adversarial signal that is concentrated at improving the results in that particular path.

3.1 Inference tasks

For image generation, the input is a set of vectors (). The path is described in Fig. 2, Eq. 27.

For image autoencoding, the input is an image . The path is described in Fig. 2. The precise flow is: (1) encode the foreground through the content and style encoders (), Eq. 8,9, (2) generate a foreground image and mask with the foreground generator , Eq. 6, (3) encode the image and mask through the background encoder , Eq. 10, (4) generate the background image with the background generator , Eq. 4, and (5) compose the final image with (), Eq. 7. The background encoding is delayed, because it relies on a segmentation mask which has to be generated first.

For segmentation, the input is an image . The path is the same as steps (1-2) in the autoencoding task. The segmentation mask is the produced foreground mask .

For clustering, the input is the image . The clusters are acquired by applying the content and style encoders (

) and use the prediction of the parent and child classes for hierarchical clustering or only the child class for non-hierarchical clustering. The clusters can be determined by the predicted one-hot vectors (

), but better results, and the ones shown in the evaluation section, were reached when applying the K-means algorithm on the concatenated code vectors


4 Multi-phase training

Without a multi-phase training, the networks would train to generate fake images and autoencode real and fake images simultaneously. While the generation flow encourages a separation between the background and foreground components, the autoencoding flow resists this separation due to the trivial solution of encoding and decoding the image in one of the paths (foreground or background) and applying an all-zero or all-one mask.

The multi-phase training has two phases. In the first phase, we train the generators to synthesize images conditioned on latent variables, while considering the background and the foreground separately. In this controlled environment, the generators are much more likely to converge to the required setting and in less training iterations. After a decent amount of iterations, determined in advance by a hyperparameter, the second phase kicks in, where (i) the model is trained to also reconstruct images through both latent variables and bypass information from the encoders, and (ii) the information from the encoders and the information from the latent vectors are merged with the mixup module at two separate locations independently.

4.1 Phase I of training

In the first phase of training, the model learns to generate images in a way that relies on generating a background image , a foreground image , and a foreground mask , as in Fig. 3. The discriminators are trained along with the generators and produce an adversarial training signal. The encoders are also trained in this phase, but only to retrieve the latent variables from the generated images.

The losses in Phase I can be put into four groups: adversarial losses, classification losses, distance losses, and regularizations. For brevity, represents the dependence on the three prior codes (). Similarly, represents the full generation of the final image, Eq. 37.

Adversarial losses  These involve the two discriminators and are derived from the minimax equation: , for a generic generator and discriminator . The concrete GAN loss is the sum of the losses for the separation between real/fake background, the separation between background/object and the separation between real/fake object.

For the generators, the losses are:


For the discriminators, where are the sets of real background images and real object images, the losses are:


Classification losses  These losses optimize the generators to generate distinguished images for each style and shape prior and the encoders to retrieve the prior classes.


where CE is the cross entropy loss, and is the conditional task of the discriminator described in Sec. 3. are class predictions from Eq. 8,9 and are the child and parent classes the image was generated with, Eq. 1.

Distance losses  We train the encoders to minimize the mean squared error between the vectors in the latent space produced during generation and their predicted counterparts. These vectors are used in the next phase to feed the generator in the autoencoding path, making it beneficial to add this constraint already at this phase.


where are the ”pre-image” and ”bypass” tensors, computed by Eq. 3,5,8,10, and are the mean vectors to sample the latent code () with.

Regularization losses  For regularization, a loss term is applied on the latent codes (), annotated as , and on the foreground mask , annotated as . They are detailed in the appendix.

All the losses are summed together to the total loss:


4.2 Phase II of training

In the second phase of the training, an additional task is added, where the network has to utilize the encoders and generators to reconstruct an input image, as in Fig. 4. To each batch iteration, two new reconstruction paths are performed, where the input image can be either a real image, or a fake image generated from latent code in advance. To smooth the transition for sole image generation to multi-task state, we begin with only reconstructing fake images, which the model is already familiar with, and delay the real image reconstruction with a hyperparameter. We have also found that starting with only training the encoders and keeping the generators fixed in the first iterations of this phase also assists in this transition.

This phase is also when the mixup is introduced. The application of mixup in our setting serves two purposes in two separate locations inside the compound network, hence the term dual mixup.

The first mixup (mixup0 in Fig. 2) mixes the vectors provided by the encoders with the one-hot vector embeddings given by the LUT for the one-hot vectors . When fine-tuning the embeddings towards real image classes, we want to reduce the distance between the two matching vectors. Therefore, we merge the two types of vectors with the mixup module, such that both are part of the reconstruction and both will be optimized during the back-propagation.

The second mixup (mixup1 in Fig. 2) mixes the bypass with the pre-image

. It serves to create the condition where the reconstruction path will be simultaneously dependent on the foreground bypass and on the lower stage of the foreground generator. In contrast to regular residual connections, the ever changing

used in the mixup forces the bypass and the pre-image to be independent representations and not complement each other. This way, at any time we can choose any or even pass only the bypass or only the pre-image and result in an almost identical image.

Given two inputs and a parameter , . The mixup is defined as follows:


where are randomly sampled in each iteration for each instance in the batch. When mixup is applied, the mixed features () replace the pre-mixup features () as inputs to the generator, see Eq. 5,6.

The losses in this phase can be put in three groups: statistical losses, reconstruction losses, and perceptual losses.

Statistical losses  As in Variational Auto-Encoders [16], we compare the Kullback-Leiber Divergence between the latent variables encoded by the encoders () to a multivariate gaussian distribution. For the pose vector

, we used the standard normal distribution with covariance matrix equal to the identity matrix (

) and a zero mean vector (). For the shape and style vectors () we still use identity , but since they should match their latent code (), we use these latent codes as the target mean.


Reconstruction losses  The reconstruction losses are a set of L1 losses that compare the difference between the input image to the output. The network trains at autoencoding both real and fake images. For fake images, we compare reconstruction of the full image, background image, and mask. For real images, the only ground-truth is the full image, so the loss only compares that.

Perceptual losses Comparing images to their ground-truth counterpart has been shown to produce blurred images; Perceptual loss [12] is known to aid in producing sharper images with more visible context [32]. The perceptual loss is often used along with a pre-trained network, but this relies on an image-net scale type of added supervision. In our case, we use the discriminators, which were pre-trained in the first phase. We use the notation from Sec. 3 to describe the networks used to extract the hidden layers before the output of the discriminators.

where .

All the losses are summed together to the total loss:


5 Experiments

Figure 3: Image Generation for each dataset. From top to bottom: (i) final image, (ii) foreground, (iii) foreground mask, (iv) background.
Figure 4: Image reconstruction for each dataset. From top to bottom: (i) real image, (ii) reconstructed image, (iii) reconstructed foreground, (iv) reconstructed background, (v) ground-truth foreground mask, (vi) predicted foreground mask.
Figure 5: Conditional Generation. From left to right: (i) real images, (ii-vi) generation of images with the encoded parent and child codes and a different vector per column, (vii) FineGAN [23] results, (viii) StackGANv2 [31] results.
Figure 6: Style Transfer. From left to right: (i) real images. (ii-vi) reconstructed images when the child code is switched with a code from a selected category, (vii) FineGAN [23] results, (viii) StackGANv2 [31] results.
Model Birds Dogs Cars
Dataset 47.96 .738 77.13 .879 55.47 .626
StackGANv2 15.06 .171 10.24 .137 13.39 .167
FineGAN 24.75 .454 15.71 .403 13.67 .228
OneGAN 30.73 .541 19.66 .516 18.24 .275
w/o real recon 25.67 .499 17.03 .463 15.54 .288
Phase I only 21.65 .488 16.86 .443 13.42 .245
no multi-phase 2.30 .101 1.79 .158 3.93 .092
Table 1: Quantitative generation results.
Model Birds Dogs
(i) (ii) (iii) (i) (ii) (iii)
StackGANv2 6.3 1.4 5.7 3.5 1.5 5.2
FineGAN 6.0 5.0 7.0 5.8 4.4 8.2
OneGAN (ours) 7.3 6.9 8.7 6.8 6.4 8.3
Table 2: User study results. Average score between 1 to 10, for each question asked in the study. (i) overall quality (ii) conditional resemblance (iii) pose disentanglement
Model Birds Dogs Cars
ReDO 46.5 60.2 38.4 52.8 16.2 26.2
UISB 44.2 60.1 62.7 75.5 64.7 77.5
IIC stuff-3 36.5 70.2 58.5 71.5 58.5 71.5
IIC stuff 35.2 50.4 56.6 70.2 58.8 71.7
OneGAN 55.5 69.2 71.0 81.7 69.7 81.0
w/o real recon 53.5 67.7 67.1 78.6 69.8 81.1
Phase I only 45.7 60.6 65.1 77.3 64.8 75.9
no multi-phase 28.2 43.2 7.4 13.6 45.9 60.5
Table 3: Segmentation results. unfair upper bound results, obtained by selecting the best result out of many.
Model Birds Dogs Cars
JULE .204 .142 .232
DEPICT .290 .182 .329
IIC .345 .106 .200 .127 .254 .056
StackGANv2 .253 .073 .139 .075 .174 .025
FineGAN .349 .152 .194 .122 .233 .055
OneGAN (ours) .391 .173 .202 .129 .266 .071
w/o real recon .389 .171 .194 .121 .250 .066
Phase I only .352 .151 .175 .100 244 .063
no multi-phase .216 .024 .082 .021 .208 .013
Table 4: Clustering results. provided by [23] without AMI.

We train the network for 600,000 iterations, with batch size 20. All sub-networks are optimized using Adam [15], with lr=2e-4 and default arguments. Phase I duration is 200,000 iterations. Within Phase II, we start with training only on fake images and real image reconstruction starts after another 200,000 iterations.

We evaluate our model on various tasks against the state of the art methods. Since no other model can solve all these tasks, we evaluate against different methods in each task. Depending on availability, some baselines were pre-trained models released by the authors and some were trained from scratch with the authors’ code.

Datasets  We evaluate our model with three datasets of fine-grained categorization. Caltech-UCSD Birds-200-2011 (Birds) [25]: This dataset consists of 11,788 images of 200 classes of birds, annotated with bounding boxes and segmentation masks. Stanford Dogs (Dogs) [17]: This dataset consists of 20,580 images of 120 classes of dogs, annotated with bounding boxes. For evaluation, target segmentation masks were generated by a pre-trained DeepLabV3 [3] model on the COCO [19] dataset. The pre-trained model was acquired from the gluoncv toolkit [10]. Stanford Cars (Cars) [14]: This dataset consists of 16,185 images of 196 classes of cars, annotated with bounding boxes. Segmentation masks were generated as above with the pre-trained DeepLabV3 model.

For all datasets, to produce an object and background subsets , we split the dataset in a 80/20 ratio. We take the bigger subset for . Out of the smaller subset, we use the bounding boxes to cut background patches to use as background images. No image was used for both foreground and background examples and, of course, the bounding boxes were not used to train our method.

Due to the different size of classes in each dataset, there is also a different size of child and parent classes in the design for each dataset. Birds: , Dogs: , Cars: .

Conditional image generation  We compare our image generation results to FineGAN [23] and StackGANv2 [31], by relying on an InceptionV3 fine-tuned on each dataset. The normalized mutual information (NMI) between the child classes and the classes predicted by the Inception model is employed. In addition, we perform a conditional variation (C-IS) of the Inception Score [22]

(IS). For each child class, we average the predictions of the Inception model, and compute the IS on the class average. Following the same principle as the original IS evaluation, a good model should have high a probability on a single real class for each conditioned child class and a uniform distribution across all classes when averaged across all child classes.

Our results, reported in Tab. 1 show that OneGAN outperforms in both conditional image generation metrics. StackGANv2 was the worst performing model, which suggests that mask-based generation used by our model and by FineGAN is very useful for conditional generation. Our model also outperformed FineGAN, which as our ablation demonstrates, is due to the added reconstruction path during training. We back these numerical results with Fig. 6.

We strengthen these results with a user study. In the study, we chose ten different real images. For each model (OneGAN, FineGAN, StackGANv2), we encoded the images to extract the image child and parent code and then produced ten images conditioned on the codes of each of the ten images (see supplementary for the images). We then asked the following questions: (i) How realistic do the images look? (ii) How much are the objects in the images related to the conditioned images? (iii) How are images generated by the same show the same non-categorical similarity (pose, size, location, etc.). As in the results, presented in Tab. 4 show, our model generates conditioned images that are of better quality and represent a more coherent class. Our model also disentangles the categorical and non-categorical features better than the other models.

Unsupervised foreground segmentation  We compare our mask prediction from the autoencoding pathway to the real foreground mask. We evaluate according to IOU and DICE scores. We compare against two baselines, ReDO [4] and UISB [13] which are trained for each dataset separately, and a third one, IIC [11], which was trained on coco-stuff and coco-stuff-3 (a subset). While coco-stuff is a different dataset than the ones we used, it contains all the relevant classes. ReDO produces a foreground which we compare to the ground-truth similarly to how we evaluate our model. UISB is an iterative method that produces a final segmentation with a varying number of classes between 2 and 100. We iterated UISB on each image 50 times. The output was usually between 4-20 classes. Since there is no labeling of the foreground or background classes, this method cannot be immediately used for this task. In order to get an evaluation, we look for each image for the class that has the highest IOU with the ground-truth foreground. The rest of the classes are merged to a single background class. We then repeat with a single background class and the rest merged into foreground. Finally, taking the best out of the two options, each obtained by using an oracle to select out of many options, which provides a liberal upper bound on the performance of UISB. IIC also produces a multi-class segmentation map, we use it in the same way we use UISB by taking the best class for either background or foreground in respect to IOU. IIC has 2-headed output, one for the main task and one for over-clustering. For coco-stuff trained IIC, we look for the best mask in one of the 15 classes of the main head. For coco-stuff-3 trained IIC, the main head is trained to cluster sky/ground/plants, so we look for the best mask in one of the 15 classes of the over-clustering head. The results in Table. 4 show that our method outperforms all the baselines. The generated masks are presented in Fig. 4.

Unsupervised clustering  We compare our model against our encoders trained alongside StackGANv2 and FineGAN. Similarly to our model, we trained a child and parent encoders for both models during training. In this task, we evaluate how well the encoders are capable of clustering real images. We also compare our model against other clustering methods: JULE [29], DEPICT [8] and IIC [11].

The results show that our model outperforms the other models for both Birds and Dogs datasets. For the Cars dataset, similarly to the generation results, our model and other models as well, grouped images more on the basis of color than car model. This caused a reduced performance in clustering, and our model was only second best.

Object removal and inpainting  Through the reconstruction task, our model is also capable of performing automatic object removal and background reconstruction, see Fig. 4. In contrast to other known method for inpainting, due to the lack of perfect ground-truth mask, our model does not only fill the missing pixels but fully reconstructs the background image. As a result, the background image is not identical to the original background, but it is semantically similar to it. See supplementary for a comparison with previous work.

Image to image translation  To further evaluate our model, we show its capability to transfer an input image to a target category. The results can be seen in Fig. 6. Even though our model was never trained on this task, the disentanglement between the shape and the texture enables this task simply by passing a different child code during reconstruction. By selecting different child codes, we can manipulate the appearance of the object to any of the child categories. In contrast, FineGAN and StackGANv2 are unable to perform this task correctly as there is no learned disentanglement in StackGAN’s case and no bypass connection in FineGAN’s case to allow good reconstruction.

Ablation study  In Tab. 14 and 4, we provide multiple versions of our method for ablation. In the version without real reconstruction, we only add fake image reconstruction in Phase II, meaning that real images did not pass through the network during training. Another variant employs only the first phase of training. Finally, a third variant trains all losses at once, without multi-phase training. In addition, in the supplementary, we provide an ablation study supporting mixup, bypass connections, and various loss terms.

6 Conclusions

By building a single model to handle multiple unsupervised tasks at once, we convincingly demonstrate the power of co-training, by surpassing the performance of the best in class methods for each task. This capability is enabled by a complex architecture with many sub-networks. Considering biological visual system, one can expect future architectures to be complex and to contain multiple pathways between the various modules. However, supporting this complexity during training is challenging. We introduce a dual mixup loss that integrates multiple pathways in a homogenized manner and a multi-phase training, which helps to avoid some tasks dominating over the others.


This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC CoG 725974).


Appendix A Summary of notation

For convenience, in Tab. 5 we provide a complete listing of the notation used in our paper.

Appendix B Regularization

Due to lack of space in the main text, we include the regularization loss terms as part of the supplementary.

During generation, we apply regularization on the latent vectors and on the mask image. The former serves to bound the range of the values to be close to the axis center and to be closely grouped.

The regularization on the mask serves to direct the model to utilize the mask efficiently, with a balanced and decisive representation of background and foreground. For mask batch , with as the batch size and the height and width of the mask. The first regularization term balances the mask value around the value of half.

The second regularization term aims to make the masks more decisive. It is better described when the mask is between [-1,1] so we define . In the ideal case, all pixels are either 1 or -1 (either background of foreground), therefore, if we assume a balanced distribution, for each mask the average value of is 0.5 and of it’s -0.5, since half of the pixels are zeroed in each term. This is the decisiveness regularization.

Together, the mask regularization loss is:

Both regularizations are added to the rest of the generation losses in Eq. 15.

Appendix C Sub-networks architecture

In this section, we describe the details of each sub-network described in Sec. 3 and shown in Fig. 2,2. The layers of each sub-network are listed in the tables Tab. 713, with some modules that are frequently used listed in Tab. 6. The majority of the networks are sequential. When more complicated connections are present, the input and output notations are there to guide the flow.

c.1 GLU layer-normalization

Due to the training with a low batch size, having a large number of classes, and use of different paths, batch-normalization did not have a positive effect in our method. We experimented with many different alternatives, ranging from (i) layer-normalization, (ii) instance-normalization (iii) no normalization (iv) a combination of different normalization layers for each sub-network. We came to the conclusion that layer-normalization performed the best when used across all networks. Additionally, in the generators, where Gated Linear Unit (GLU) was used as non-linear activation, we realised that a more complex method needs to be applied. The GLU performs a non-linear activation function by splitting the input

in half on its channel axis, resulting in two equally sized tensors with half the number of channels as the input (). One of the tensors then goes through a sigmoid activation and then multiplied by the other tensor to produce the output.

By this implementation, if the tensor is normalized with a layer-normalization, then affect each other, and since they serve a different purpose in the normalization, it has an unwanted effect. To solve this, we separate the layers before the normalization, and apply the layer-normalization only on .

Appendix D A comprehensive ablation study

In the main paper, we address in Tab. 1,4,4, how different training methods affect on performance. We showed that applying image reconstruction at Phase II greatly improves the results in all tasks. Then, we also showed that applying not only fake image reconstruction, but also real image reconstruction, improves most of the results. Finally, we showed that the multi-phase training method is crucial by showing that a method trained without multi-phase performed very poor and, in fact, did not manage to learn any task.

The conclusion from these tests were that the use multi-phase helps the model to start learning the generation task first and then learn the reconstruction tasks on top of that learned skill. The ablation also showed that using the reconstruction task helped not only in segmentation and clustering, but also in the generation task itself, which does not use an image as input and is a task who’s training has been started before the reconstruction was introduced.

In this section, we provide more ablation studies and make the further conclusions. In addition to the ablation results reported in the main text, we also show performance in both image generation and image segmentation on the birds dataset, see Table. 14. The supplementary ablation inspects the usage of mask regularization, bypass and mixup.

In our experiments, it is evident that mask regularization and limiting mixup1 to improves the segmentation performance. Further, mask regularization, bypass and mixup all contribute to the conditional generation performance.

Specifically, mask regularization helps by distancing the model from the state that the generated masks are all-one or all-zero.

The bypass also contributes to both metrics. In the experiments we have seen that the reconstructed masks produced without bypass have an accurate coarse shape, but lack the fine details, therefore under-performing since the masks do not fit the object as well as with the bypass.

For the mixup module, we experimented with different variations of ranges for . (1) Without mixup at all, i.e. , there is a full use of the bypass, but no use of the vectors from the LUT, this resulted in a high segmentation score since the bypass was still fully used, but with a low conditional generation score, since the LUT were not optimized during autoencoding. (2) With full mixup, i.e. , there is a larger variation in the proportion of to the mixed value , which turned to be unstable and the reconstruction task suffered.

Appendix E Additional results

e.1 Conditional generation

We supply more conditionally generated images to further demonstrate the conditional generation performance. We use generation conditioned on both very different and very similar classes in order show both the broad coverage and high sensitivity to detail in the various clusters.

In Fig. 7, the images are obtained by generating five different images per each reference image in the top row. To achieve these results, our model has to perform two tasks. First, it has to be able to detect the child and parent classes under which the object is represented. Second, it needs to be able to generate a similar looking object with the predicted classes. The success in this task is evidence for both the generation and clustering capabilities of the model.

In the figure, each column shows a real image, followed by five generated images conditioned on the first image in respect to category. Additionally, in each row, all images are generated with the same , showing how non-categorical information is consistent across the different categories and how the pose is disentangled from the shape category.

For both birds and dogs, we can see that the generated images have very similar properties to their conditioned image. In both cases, we can see the large coverage of different classes and the fine detailed differences between similar classes. For cars, we can see that that the generated images apply the same color, but the car shape is changing, indicating that the model has categorized the cars by their color and not by their model.

e.2 Reconstruction

We supply more images to show the reconstruction path with the resulting reconstructed images, reconstructed backgrounds, and segmentation masks.

In Fig. 8, the results show the images generated by the reconstruction process. The generated mask shows the model’s ability to detect and segment the object, the background image shows the model’s ability to repaint the background, and the foreground image shows the model’s ability to detect and reconstruct conditioned on the object class.

The segmentation works under many different poses, sizes, and backgrounds. For dogs and cars, it can be seen that our model is sometimes better than the “ground-truth” masks, which were generated by a pre-trained network. The background repainting works well in the majority of cases, but we do notice that some backgrounds work better than others. The challenges are mostly noticeable in the cars dataset, where there was the smallest amount of background patches available in , leading to a less powerful background discriminator. Subsequently, the performance of the background generation was affected.

e.3 Image to image translation

We supply more images, Fig. 9, to show the image to image translation capability of the model. We show that the disentanglement that emerged from the design allows manipulation of the reconstructed image by replacing the child code with a code from an arbitrary category. We can see that not only do the objects in the images change appearance, but the change is consistent across different images, while the background is mostly unaffected.

In birds, we can see that the generator is usually able to detect the different parts of the birds (wings, head, beak) and apply to correct color manipulation to the correct area. Furthermore, the color manipulation works on birds of different shapes and different original colors. The background is sometimes slightly altered, first, because it is regenerated every time, and second, because the foreground mask is soft and sometimes applies a slight manipulation on the background as well. In dogs, the manipulation is less effective, but it is noticeable and is correlated to the applied category. In cars, we can see the color manipulation for many different colors. We can also notice how the background is mostly unchanged and that the manipulation is applied correctly on the car chassis and not on other parts like windows, tires and lights.

e.4 Background inpainting

Due to space limits, we could not address all tasks in the main paper. As an intermediate step of the reconstruction task, our model also performs a side-task of object foreground extraction and background inpainting, see Fig. 4. The model first detects the foreground in the image and produces a segmentation mask. Then, with the mask, the background is encoded and reconstructed. Because the mask is a prediction and not a ground-truth, the model cannot only fill the masked pixels with background texture, but has to assume that the mask was not perfect and reconstruct the entire image. The drawback of this method is that the background is not always identical to the source in the background area, but the benefit is that the object is fully removed even when it is not fully covered by the mask.

We compare our model against images produced with Deep-Image-Prior (DIP; Ulyanov et al., CVPR 2018). There are two variants. In the fist, DIP receives the ground-truth mask and in the second, the predicted mask is given. DIP optimizes its network for 1000 steps on the input image.

The results can be seen in Fig.10. It can be observed that DIP works relatively well when using a perfectly covering mask, but fails when the mask does not fully cover the object. In contrast, our model suffers less when the mask is not perfect. We can also see that our model does not exactly inpaints the background but actually repaints it, which usually results in a slightly different background even where the image was not masked, but as we mentioned above, it may beneficial when the mask is not perfect. Finally, our model performs the inpainting task in a single forward path instead of 1000 iterations of DIP.

e.5 User study images

In the main text, we present a user-study to support the conditional generation results, see Tab. 4. We provide the images used in the study in Fig. 11. In the figure, the images are shown order. The participants were shown the images separately and in a random order.

Symbol Description Computed as (or a comments)


child class
parent class
child class one-hot vector (style)
parent class one-hot vector (shape)
background one-hot vector
pose code
style code vector
shape code vector
background code vector
foreground pre-image
background pre-image
foreground mask
foreground image
background image
full image
foreground bypass
background bypass
image domain
background image domain


embedding LUT of child class
embedding LUT of parent class
embedding LUT of background
foreground generator, with subnetworks
background generator, with subnetworks
style encoder
content encoder
content encoder
image discriminator
background discriminator


number of child classes Depends on the dataset. Sec. 5
number of parent classes , Depends on the dataset.
dimensionality of pose code
dimensionality of style code
dimensionality of shape code
dimensionality of background code
size of image
Table 5: The components of the OneGAN model
Figure 7: Conditional Image Generation. From top to bottom: (i) real image, (ii-vi) generation of images with the encoded parent and child codes and a different vector per row.
Figure 8: Image Reconstruction. From top to bottom: (i) real image, (ii) reconstructed image, (iii) reconstructed foreground, (iv) reconstructed background, (v) ground-truth foreground mask, (vi) predicted foreground mask.
Figure 9: Image to Image Translation. From left to right: (i) real image, (ii-xiii) reconstructed images when the child code in each column is switched with a code from a selected category represented by the top image.
Figure 10: Background Inpainting. From top to bottom: (i) original image, (ii) image masked with real mask, (iii) image masked with predicted mask, (iv) OneGAN, (v) DIP with real mask, (vi) DIP with predicted mask.
Figure 11: User Study. From top to bottom. (i) OneGAN birds/dogs, (ii) FineGAN birds/dogs, (iii) StackGANv2 birds/dogs. In each row, 10 generated images conditioned upon the real image on the left.
Module layers input output
GLUNorm ChannelSplit
Multiply -
UPBlk Upsample2d(/2, ) - -
() K3P1Conv2d() - -
GLUNorm - -
DOWNBlk K4S2P1Conv2d(,) - -
() LayerNorm - -
lReLU(0.2) - -
RESBlk0 K3P1Conv2d(,) -
() GLUNorm - -
K3P1Conv2d(,) - -
GLUNorm -
Add -
RESBlk K3P1Conv2d(,) - -
() GLUNorm - -
RESBlk0() - -
RESBlk0() - -
K3P1Conv2d(,2) - -
GLUNorm - -
Table 6: General modules
Module layers input output
Linear(, )
Linear(, 32768) -
Reshape(2048,4,4) - -
GLUNorm - -
UPBlk(1024,512,8) - -
UPBlk(512,256,16) -
UPBlk(256,128,32) -
UPBlk(128,64,64) - -
UPBlk(64,32,128) - -
K3P1Conv2d(32,3) + tanh -
Table 7: Background Generator
Module layers input output
Linear(, )
Linear(, )
Linear(, 32768) -
Reshape(2048,4,4) - -
GLUNorm - -
UPBlk(1024,512,8) - -
UPBlk(512,256,16) -
UPBlk(256,128,32) -
UPBlk(128,64,64) - -
UPBlk(64,3,128) -
K3P1Conv2d(16,3) + tanh
K3P1Conv2d(16,1) + sigmoid
Table 8: Foreground Generator
Module layers input output
K4S2P1Conv2d(3, 64) -
LayerNorm - -
lReLU(0.2) - -
DOWNBlk(64, 128) - -
DOWNBlk(128, 256) -
DOWNBlk(256, 512) -
DOWNBlk(512, 1048) - -
K3P1Conv2d(1048, 1048) - -
LayerNorm - -
lReLU(0.2) - -
Reshape(16384) - -
Linear(16384, 512) - -
LayerNorm - -
lReLU(0.2) -
Linear((512, )
Linear(512, )
Linear(512, )
Table 9: Style Encoder
Module layers input output
K4S2P1Conv2d(3, 64) -
LayerNorm - -
lReLU(0.2) - -
DOWNBlk(64, 128) - -
DOWNBlk(128, 256) -
K3P1Conv2d((256, 512) -
GLUNorm - -
UPBlk(256,256) -
DOWNBlk(256, 512) -
DOWNBlk(512, 1048) - -
K3P1Conv2d(1048, 1048) - -
LayerNorm - -
lReLU(0.2) - -
Reshape(16384) -
Linear(16384, 512) -
LayerNorm - -
lReLU(0.2) -
Linear((512, )
Linear(512, )
Linear(512, )
Linear(16384, 512) -
LayerNorm - -
lReLU(0.2) -
Linear(512, )
Linear(512, )
Table 10: Shape Encoder
Module layers input output
K4S2P1Conv2d(4, 64) -
DOWNBlk(64, 128) - -
DOWNBlk(128, 256) -
K3P1Conv2d((256, 512) -
GLUNorm - -
UPBlk(256,256) -
Table 11: Background Encoder
Module layers input output
DonwSample2d(128, 126) -
K4S2P0Conv2d(3, 64) - -
lReLU(0.2) - -
K4S2P0Conv2d(64, 128) - -
lReLU(0.2) - -
K4S4P0Conv2d(128, 256) - -
lReLU(0.2) -
Table 12: Background Discriminator
Module layers input output
K4S2P1Conv2d(4, 64) -
LayerNorm - -
lReLU(0.2) - -
DOWNBlk(64, 128) - -
DOWNBlk(128, 256) - -
DOWNBlk(256, 512) -
DOWNBlk(512, 1048) - -
K3P1Conv2d(1048, 1048) - -
LayerNorm - -
lReLU(0.2) - -
Reshape(16384) -
Linear(16384, 512) -
LayerNorm - -
lReLU(0.2) - -
Linear(512, 1) -
Linear(16384, 512) -
LayerNorm - -
lReLU(0.2) - -
Linear(16384, 512) -
Table 13: Object Discriminator
Task: Segmentation Conditional generation
Metric: IOU C-IS
OneGAN 55.5 30.7
no mask-reg 35.3 19.5
no bypass 53.3 22.8
no mixup 54.1 17.5
full mixup 28.8 23.1
Table 14: Additional ablation results on the birds dataset. Comparison is done for segmentation (IOU) and conditional generation (C-IS).