Towards Lifelong Self-Supervision For Unpaired Image-to-Image Translation

03/31/2020 ∙ by Victor Schmidt, et al. ∙ Montréal Institute of Learning Algorithms 15

Unpaired Image-to-Image Translation (I2IT) tasks often suffer from lack of data, a problem which self-supervised learning (SSL) has recently been very popular and successful at tackling. Leveraging auxiliary tasks such as rotation prediction or generative colorization, SSL can produce better and more robust representations in a low data regime. Training such tasks along an I2IT task is however computationally intractable as model size and the number of task grow. On the other hand, learning sequentially could incur catastrophic forgetting of previously learned tasks. To alleviate this, we introduce Lifelong Self-Supervision (LiSS) as a way to pre-train an I2IT model (e.g., CycleGAN) on a set of self-supervised auxiliary tasks. By keeping an exponential moving average of past encoders and distilling the accumulated knowledge, we are able to maintain the network's validation performance on a number of tasks without any form of replay, parameter isolation or retraining techniques typically used in continual learning. We show that models trained with LiSS perform better on past tasks, while also being more robust than the CycleGAN baseline to color bias and entity entanglement (when two entities are very close).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Motivation

In recent years generative unsupervised image-to-image translation (I2IT) has gained tremendous popularity, enabling style transfer [cyclegan] and domain adaptation [cycada], raising awareness about wars [deepempathy] and Climate Change [vicc] and even helping model cloud reflectance fields [aicd]

. I2IT has become a classical problem in computer vision which involves learning a conditional generative mapping from a source domain

to a target domain . For example, generating an image of a zebra conditioned on an image of a horse. Obviously, there is no ground-truth data for this transformation and we cannot therefore leverage pairs to learn this generative mapping. This is the challenge that unpaired I2IT addresses.

One of the main limitations of the I2IT task is that data is often scarce and hard to acquire [vicc, cyclegan, lee2018diverse]. To overcome this difficulty, self-supervised learning (SSL) appears to be a promising approach. In SSL, a model is trained on an auxiliary task (e.g., image rotations prediction) that leverages unsupervised data in order to obtain better representations which can help a downstream task learn faster when few (labeled) samples are available [jing2019self-supervised]. Given the variety of such potential auxiliary tasks, one could hope to jointly train many of them along with the main task, thereby improving the performance on the latter. However, this may be impractical in the context of I2IT since the models are typically quite large, making parallel training of self-supervised and translation tasks computationally intractable. On the other hand, any form of sequential learning may result into catastrophic forgetting [FRENCH1999128] and counter the benefits of SSL. In this paper, we therefore investigate how continual learning, a set of approaches designed to make sequential learning across multiple tasks robust to catastrophic forgetting, can be used to enable self-supervised pre-training of unpaired I2IT models.

We show that self-supervised auxiliary tasks improve CycleGAN’s translations with more semantic representations and that distillation [hinton, zhai2019lifelong] retains the knowledge acquired while pre-training the networks. For easier reference, we call this framework ”Lifelong Self-Supervision”, or LiSS, and show its results on CycleGAN’s performance in Section 3.

1.2 Related Work

Generative Adversarial Networks (GANs) [gangoodfellow] have had tremendous success in generating realistic and diverse images [karras2019analyzing, YiLLR19, brock2018large]. Generative I2IT approaches often leverage GANs to align the distributions of the source and target domains and produce realistic translations [huang2018multimodal]. In their seminal work, isola2016image proposed a principled framework for I2IT by introducing a weighted average of a conditional GAN loss, along with an reconstruction loss to leverage pairs (for instance edges photos or labels facade). To address the setting where pairs are not available, cyclegan introduced the cycle-consistency loss which uses two networks to symmetrically model sourcetarget and sourcetarget. The cycle-consistency induces a type of self-supervision by enforcing a reconstruction loss when an image goes through one network and then the other. Many attempts have since been made to improve the diversity [huang2018multimodal, lee2018diverse] and semantic consistency [mo2018instagan, mejjati2018unsupervised] of CycleGAN-inspired I2IT models by leveraging an encoder-decoder view of the models. We keep with the CycleGAN encoder-decoder model, and use self-supervision to encourage the encoder to encode meaningful features for the decoder to work with (see section 2 for more details).

Self-supervised learning tries to leverage the information already present in samples to produce data-efficient representations. This is often measured by pre-training a standard supervised network on an auxiliary (or pretext) task, and then measuring its performance on the actual dataset on a fixed, low budget of labels [jing2019self-supervised]. Though not new [de1994learning]

, it has gained a lot of popularity in the deep learning era with important successes in language modeling

[pennington2014glove, devlin2018bert, howard2018universal] speech recognition [ravanelli2020multitask], medical imaging [raghu2019transfusion] and computer vision in general [jing2019self-supervised]. Indeed, computer vision models seem to benefit significantly from self-supervised learning as the amount of unlabeled data can be very large, while labeling can be expensive [jing2019self-supervised]. In this context, many visual pre-training tasks have been proposed, such as image rotation prediction [gidaris2018unsupervised], colorization [colorization], and solving jigsaw puzzles [jigsaw]. In addition to these context-based and generation-based pre-training methods, one can also leverage pseudo-labels from a pre-trained model in free semantic label-based methods [jing2019self-supervised]. In our work we therefore add a depth prediction pretext task, as advocated by [doersch2017multitask], based on inferences from MegaDepth [li2018megadepth]. As the number of pretext tasks increases, so does the memory and computational time needed to process samples. This is especially problematic for generation-based methods which can be as computationally and memory intensive as the downstream task’s model. We cannot therefore hope to train large models such as those used in I2IT, in parallel with all these tasks.

One must therefore derive a learning procedure which ensures that the networks do not forget as they change tasks: this is the focus of continual (or lifelong) learning. Neural networks have been plagued by the inability to maintain performance on previously accomplished tasks when they are trained on new ones - a phenomenon that has been coined

catastrophic forgetting [kirkpatrick2017overcoming]. Various continual learning methods have been developed to mitigate forgetting which can be categorized as follows [lange2019continual]: replay-based methods, regularization-based methods and parameter isolation methods. In their work, matsumotocontinual use the parameter-isolation method PiggyBack [mallya2018piggyback] in order to learn a sequence of I2IT tasks without forgetting the previous ones. zhai2019lifelong on the other hand uses distillation [hinton] in order to perform such tasks. In this work, we borrow ideas from the latter and apply them to a sequence of self-supervised tasks followed by a translation task.

2 Approach

2.1 Model

Our main contribution is a continual learning framework which maintains self-supervised performance and improves unpaired I2IT representations. We chose as our I2IT model the simple and well-understood CycleGAN [cyclegan].

Let be a set of tasks such that is a set of self-supervised tasks and is an I2IT task such as horsezebra [cyclegan]. The model is composed of two domain-specific sets of networks and where is a multi-headed generator and is a set of discriminators (1 per generative pretext task and 1 for the translation task). From now on, will be either or in which case is or . All the following is symmetric in and .

Let us focus on . It is composed of an encoder and a set of task-specific heads which map from the encoder’s output space to ’s output space. Let be a sample from domain and . In our work, we focus on the following 4 pretext tasks:

  1. is a rotation task, inspired from [gidaris2018unsupervised], and performs a classification task: as there are 4 possible rotations (, , and ). When appropriate (see 2.2) we train with a cross-entropy objective .

  2. is a jigsaw puzzle as introduced by [jigsaw]. We split the image into 9 equal sub-regions which we randomly reorder according to 64 pre-selected permutations (out of the possible ones): . Similarly, we train with a cross-entropy objective .

  3. is a relative depth prediction task inspired from [jiang2017selfsupervised]: . is trained with an objective with respect to pseudo-labels obtained from a pre-trained MegaDepth model [li2018megadepth].

  4. is a colorization task, as per [colorization, larsson2017colorproxy]: . Because a gray image can have several possible colorizations, we train by a mixture of loss with respect to and a GAN loss from a discriminator :

The downstream translation task is based on CycleGAN’s losses. For simplicity, in the following equations we call what is actually , that is to say the standard CycleGAN generator.

(1)
(2)
(3)
(4)

The overall model for domain is therefore composed of a shared encoder network and a set of heads which map from this latent space to their specific task’s output space. We now need to understand how these tasks can be combined together in order to enable forward transfer from each of the self-supervised tasks to the translation task, without forgetting.

2.2 Training Schedule

When trying to incorporate self-supervised learning ideas into the I2IT framework, one could naively train all the heads in parallel ( are scalars weighting the contribution of each loss):

(5)

As explained previously, not only is this approach slower in that each sample has to go through all heads, but it also forces us to use smaller batch sizes for memory constraints.

Another naive approach would be to perform each task sequentially. Given an ordering of , one could train the model with:

(6)

In this sequential training schedule, the model transitions from () to () according to some curriculum. For readability and without loss of generality, we omit from now on. In our experiments we implement a threshold-based curriculum where the transition from one task to the next depends on its performance (in both domains and ) on some validation metric (see Section 3.1).

In this paper we introduce Lifelong Self-Supervision, a continual schedule which is similar to the sequential with the addition of a distillation loss. Inspired by tarvainen2017ean, we maintain an exponential moving average of past encoder parameters, therefore keeping a weighted memory of all past encoders at the cost of single additional one. Formally, let be the frozen state of at the end of the task, i.e. when transitioning from to . Then we define the (non-trainable) reference encoder as follows:

(7)

With . We use in an additional distillation term in the loss, minimizing the distance between the current and reference encoded spaces:

(8)
(9)

These ideas are general and not specific to I2IT or CycleGAN ; this is why LiSS refers to , not to a specific model.

Figure 1: Training of the first pretext task . Orange blocks are (de-)convolutions, red blocks are sets of residual blocks. Here, ’s structure is for illustration purposes, see Appendix A for more details
Figure 2: Training of the translation task . As all tasks , it includes a distillation loss between the current encoder’s output and the reference encoder’s, .

3 Experiments

3.1 Setup

To evaluate the effect of LiSS on CycleGAN’s performance, we compare it with a baseline CycleGAN from [cyclegan] and to the two aforementioned naive training schedules: sequential and parallel. We compare these 4 models on the horsezebra dataset on a dataset of floodednon-flooded street-level scenes from [vicc] (the task is to simulate floods). As our goal is to understand how to efficiently leverage a set of given pretext tasks to improve representations, we keep constant across experiments.

All models are trained with the same hyper-parameters. We use the RAdam optimizer [radam] with a learning rate of . We keep and to their default values and set all to . We leave the analysis of ’s exact impact for future work and set it to across experiments (see Eq. 7). Results are compared after k translation steps. The continual and sequential models therefore have more total steps of training but in all cases is trained for k steps (i.e. 24 hours of training with the LiSS training schedule). We set a fixed curriculum as per section 2.1 with thresholds at 85% for classification tasks, and distance of 0.15 for regression tasks. These were set to be 95% of the parallel schedule’s final validation performance. Batch-size is set to 5, for LiSS and sequential schedules, but to 3 for the parallel schedule111these are the largest values which fit in a Nvidia V100’s 16GB of GPU memory..

3.2 Image-to-Image Translation

Figure 4 and 4 show how the LiSS framework visually fares against the other schedules. Images are referred to as meaning row and column in those figures.

While our setup does not quite match the pixelwise translation performance of CycleGAN, the model learns some interesting semantic features. Unlike CycleGAN which tends to merge distinct instances of entities that are very close to each other (Figure 4’s image for instance), our model is able to disentangle these instances and retain them as separate entities after translation. We can also see from Figure 4’s image and Figure 4’s images and that CycleGAN relies on color-based features, as evidenced by the zebra stripes projected on the brown patch of ground and the sky artificats. On the other hand, adding self-supervised tasks makes the models less sensitive to such glaring errors (see rows below the aforementioned CycleGAN translations in Figures 4 and 4).

Compared to the parallel schedule, LiSS keeps relevant features without enforcing a continuous training of , which gives useful freedom to the model. It is able to have a similar translation performance and better color consistency, though one could argue that the parallel’s translations are visually slightly better.

The sequential schedule on the other hand seems to have slightly worse translation performance. We can see that some of the useful ”knowledge” the two other models still have is no longer available to the translation head as the smaller zebra is merged with the taller one in Figure 4’s image and the brown patch in image shows slight stripes.

Figure 3: Comparison of models on the horsezebra dataset, with rows corresponding the image to translate (row 0) then translations from: CycleGAN (row 1), LiSS CycleGAN (row 2), Parallel Schedule (row 3), Sequential schedule(row 4). Note: the slight discrepancy in cropping is due to data-loading randomness
Figure 4: Comparison of models on the floodednon-flooded dataset, with rows corresponding the image to translate (row 0) then translations from: CycleGAN (row 1), LiSS training (row 2), parallel training (row 3), sequential training (row 4).
Figure 3: Comparison of models on the horsezebra dataset, with rows corresponding the image to translate (row 0) then translations from: CycleGAN (row 1), LiSS CycleGAN (row 2), Parallel Schedule (row 3), Sequential schedule(row 4). Note: the slight discrepancy in cropping is due to data-loading randomness

3.3 Continual Learning Performance

Our main finding is that Lifelong Self-Supervision partially prevents forgetting. We can see in figures 5 and 6 that our formulation preserves the model from a forgetting as severe as in sequential training while providing enough flexibility for it to learn new tasks.

In both datasets, we observe that the naive training schedules behave as expected: the sequential one is able to learn new tasks the fastest as the model is less constrained. However, it is noticed that the sequential setup forgets previous tasks almost instantly as it changes its focus to a new task. On the other hand, the more constrained parallel schedule shows that continuously training on tasks allows the model to master them all at once. This however comes at a memory and time cost as we could not fit more than 3 samples per batch (vs 5 for the other schedules), and the average processing time per sample is much larger (0.27s against an average of 0.12s for the other schedules). This means that to complete 230k translation steps, the parallel schedule typically takes more than h when LiSS only takes h (counting all the pretext tasks).

Figures 5 and 6 show how LiSS maintains accuracies for the Rotation and Jigsaw tasks while performing slightly worse on the Depth prediction and Colorization tasks. As the encoder produces increasingly richer representations, the distillation loss prevents it from mapping input images to regions that would harm previous tasks. Because of our problem’s sequential nature, decoding heads do not change after they have achieved the curriculum’s required performance and the burden of producing stable yet informative features entirely relies on the encoders as the heads cannot adjust to its changes.

Table 1 and 2 show that it takes more steps for the tasks to be learnt with LiSS. Intuitively, when training sequentially, the encoders are free to adjust exactly to the task. When training with LiSS, they are more constrained and it takes more iterations for them to reach the same performance on pretext tasks. This constraint is however pliable enough for encoders to adjust to new tasks.

Schedule Task Start_Step End_Step
LiSS Rotation 0 8 000
Jigsaw 8 000 158 000
Depth 158 000 170 000
Colorization 170 000 172 000
Sequential Rotation 0 24 000
Jigsaw 24 000 96 000
Depth 96 000 102 000
Colorization 102 000 108 000
Table 1: Transition steps for the horsezebra task. Translation starts when the colorization task is mastered.
(a) Accuracies - LiSS
(b) Accuracies - Parallel
(c) Accuracies - Sequential
(d) Losses - LiSS
(e) Losses - Parallel
(f) Losses - Sequential
(a) Accuracies - LiSS
(b) Accuracies - Parallel
(c) Accuracies - Sequential
(d) Losses - LiSS
(e) Losses - Parallel
(f) Losses - Sequential
Figure 5: Validation performance of the various schedules on the horsezebra dataset. Accuracies are reported in the top row for the rotation and jigsaw heads of both and . Similarly, colorization (named gray in the plots) and depth prediction regression performances are plotted in the bottom row. Note how, unlike sequential training, LiSS training maintains validation accuracies even though the model does not see the tasks anymore. Losses bump a little but converge to a better value than the sequential’s. This illustrates how the LiSS training framework enables the model to leverage independent tasks’ benefits while maintaining sufficient flexibility to learn new tasks, at a very low cost.
Figure 6: Same plots as in Figure 6 for the floodednon-flooded dataset. Once again we can see the drastic difference between LiSS and the naïve sequential training schedule. The difference is much milder when comparing LiSS with parallel training. The distillation loss prevents forgetting and maintains performance while allowing the network to learn new tasks. Transition steps are referenced in the Appendix’s table 2.
Figure 5: Validation performance of the various schedules on the horsezebra dataset. Accuracies are reported in the top row for the rotation and jigsaw heads of both and . Similarly, colorization (named gray in the plots) and depth prediction regression performances are plotted in the bottom row. Note how, unlike sequential training, LiSS training maintains validation accuracies even though the model does not see the tasks anymore. Losses bump a little but converge to a better value than the sequential’s. This illustrates how the LiSS training framework enables the model to leverage independent tasks’ benefits while maintaining sufficient flexibility to learn new tasks, at a very low cost.

4 Discussion

We propose a method, Lifelong Self-Supervision (LiSS), enabling CycleGAN to leverage sequential self-supervised auxiliary tasks to improve its representations. By distilling the knowledge of a reference encoder (which is an exponential moving average of previous encoders, in parameter space) we prevent catastrophic forgetting of the auxiliary tasks, thereby allowing CycleGAN to better disentangle instances of objects to be translated and rely less on colors. This framework can bring benefits of training on all the tasks at once at a much lower memory and computational cost as it only requires us to keep one additional encoder. Our exploratory experiments show encouraging results which will need further investigation in future work to produce a principled framework.

Open questions include the exact impact of the reference encoder’s algebra (namely exponential moving average versus other moving averages and the impact of ), a more thorough hyper-parameter search in order to tune and and achieve better pixel-level results. Additionally, exploring schedules and auxiliary tasks would allow for a better understanding of how SSL can improve unpaired I2IT models. Finally, while CycleGAN’s simplicity allowed us to isolate LiSS’s contribution to improved translations, exploring its capabilities on more complex architecture is a promising direction for future work.

References

Appendix A Implementation details

Model Task Start_Step End_Step
LiSS Rotation 0 24 000
Jigsaw 24 000 158 000
Depth 158 000 174 000
Colorization 174 000 176 000
Sequential Rotation 0 28 000
Jigsaw 28 000 114 000
Depth 114 000 122 000
Colorization 122 000 124 000
Table 2: Transition steps for the floodednon-flooded task. Translation starts when the colorization task is mastered.

Our framework’s network architecture follows the baseline CycleGAN [cyclegan] with some differences in the generator to support self supervision. We use “ResnetBlock” to denote residual blocks [residual_blocks]. “CHW-S-P Conv” represents a convolutional layer with C channels having kernel size H

W with padding P and stride S. “NConv” denotes a convolutional layer followed by an instance norm. “TConv” denotes transpose convolution layer proposed by 

conv_transpose followed by instance norm.

Discriminator Network Architecture.

We use PatchGANs [isola2017image_patchgan, li2016precomputed_patchgan, ledig2017photo_patchgan] as the one used in the original CycleGAN [cyclegan] baseline model shown in Table 3. The discriminator’s output is a real or fake label for overlapping

patches. The GAN loss function then compares the target’s label real or fake to the average of patches predictions of the input image.

Layer Output Activation
Input None
Conv LeakyReLU
NConv LeakyReLU
NConv LeakyReLU
NConv LeakyReLU
Conv None
Table 3: Discriminator’s PatchGAN Architecture

Encoder Network Architecture.

The encoder network’s architecture is inspired from [johnson2016perceptual], as shown in Table 4. The network starts with a reflection padding of size 3 and zero padded 7x7 convolutions to avoid severe artifacts around the borders of the generated images, followed by 3x3 convolutional blocks with padding 1 and stride 2 to downsample the input image and finally by 3 Residual Blocks.

Layer Output Activation
Input None
ReflectionPad p=3 None
NConv ReLU
NConv ReLU
NConv ReLU
ResnetBlock None
ResnetBlock None
ResnetBlock None
Table 4: Encoder’s Network Architecture

Translation and Colorization Head Architectures

The translation head’s network’s architecture follows the standard CycleGAN generator [cyclegan] as shown in Table 5. It consists of 3 residual blocks followed by upsampling convolutions. For colorization to share the encoder with other tasks, we repeat gray scale images along the channel dimension.

Layer Output Activation
Input None
ResnetBlock None
ResnetBlock None
ResnetBlock None
TConv ReLU
TConv ReLU
ReflectionPad p=3 None
Conv Tanh
Table 5: Decoder and Colorization Network Architecture

Rotation Network Architecture.

The rotation head’s architecture is inspired from [gidaris2018unsupervised] and shown in Table 6. The network performs a simple classification task out of 4 possible rotations (, , and ).

Layer Output Activation
Input None
NConv LeakyReLU
NConv LeakyReLU
MaxPool None
NConv LeakyReLU
Flatten None
Linear None
Table 6: Rotation Network Architecture

Jigsaw Network Architecture.

Jigsaw’s network predicts the correct indices order of shuffled patches of an input image. The network consists of a set of convolutions extracting useful features from input image and then a fully connected layer to map it to the possible permutations. The model’s architecture shown in Table 7 performs a classification task over 64 possible permutations of shuffled images order.

Layer Output Activation
Input None
NConv LeakyReLU
NConv LeakyReLU
MaxPool None
NConv LeakyReLU
Flatten None
Linear None
Table 7: Jigsaw Network Architecture

Depth Prediction Network Architecture.

Depth network architecture is inspired from [jiang2017selfsupervised] and shown in Table 8. The network is trained on labels predicted using a pre-trained MegaDepth Model [li2018megadepth].

Layer Output Activation
Input None
ResnetBlock None
ResnetBlock None
ResnetBlock None
TConv ReLU
TConv ReLU
ReflectionPad p=3 None
Conv None
Table 8: Depth Prediction’s Network Architecture