In recent years generative unsupervised image-to-image translation (I2IT) has gained tremendous popularity, enabling style transfer [cyclegan] and domain adaptation [cycada], raising awareness about wars [deepempathy] and Climate Change [vicc] and even helping model cloud reflectance fields [aicd]
. I2IT has become a classical problem in computer vision which involves learning a conditional generative mapping from a source domainto a target domain . For example, generating an image of a zebra conditioned on an image of a horse. Obviously, there is no ground-truth data for this transformation and we cannot therefore leverage pairs to learn this generative mapping. This is the challenge that unpaired I2IT addresses.
One of the main limitations of the I2IT task is that data is often scarce and hard to acquire [vicc, cyclegan, lee2018diverse]. To overcome this difficulty, self-supervised learning (SSL) appears to be a promising approach. In SSL, a model is trained on an auxiliary task (e.g., image rotations prediction) that leverages unsupervised data in order to obtain better representations which can help a downstream task learn faster when few (labeled) samples are available [jing2019self-supervised]. Given the variety of such potential auxiliary tasks, one could hope to jointly train many of them along with the main task, thereby improving the performance on the latter. However, this may be impractical in the context of I2IT since the models are typically quite large, making parallel training of self-supervised and translation tasks computationally intractable. On the other hand, any form of sequential learning may result into catastrophic forgetting [FRENCH1999128] and counter the benefits of SSL. In this paper, we therefore investigate how continual learning, a set of approaches designed to make sequential learning across multiple tasks robust to catastrophic forgetting, can be used to enable self-supervised pre-training of unpaired I2IT models.
We show that self-supervised auxiliary tasks improve CycleGAN’s translations with more semantic representations and that distillation [hinton, zhai2019lifelong] retains the knowledge acquired while pre-training the networks. For easier reference, we call this framework ”Lifelong Self-Supervision”, or LiSS, and show its results on CycleGAN’s performance in Section 3.
1.2 Related Work
Generative Adversarial Networks (GANs) [gangoodfellow] have had tremendous success in generating realistic and diverse images [karras2019analyzing, YiLLR19, brock2018large]. Generative I2IT approaches often leverage GANs to align the distributions of the source and target domains and produce realistic translations [huang2018multimodal]. In their seminal work, isola2016image proposed a principled framework for I2IT by introducing a weighted average of a conditional GAN loss, along with an reconstruction loss to leverage pairs (for instance edges photos or labels facade). To address the setting where pairs are not available, cyclegan introduced the cycle-consistency loss which uses two networks to symmetrically model sourcetarget and sourcetarget. The cycle-consistency induces a type of self-supervision by enforcing a reconstruction loss when an image goes through one network and then the other. Many attempts have since been made to improve the diversity [huang2018multimodal, lee2018diverse] and semantic consistency [mo2018instagan, mejjati2018unsupervised] of CycleGAN-inspired I2IT models by leveraging an encoder-decoder view of the models. We keep with the CycleGAN encoder-decoder model, and use self-supervision to encourage the encoder to encode meaningful features for the decoder to work with (see section 2 for more details).
Self-supervised learning tries to leverage the information already present in samples to produce data-efficient representations. This is often measured by pre-training a standard supervised network on an auxiliary (or pretext) task, and then measuring its performance on the actual dataset on a fixed, low budget of labels [jing2019self-supervised]. Though not new [de1994learning]
, it has gained a lot of popularity in the deep learning era with important successes in language modeling[pennington2014glove, devlin2018bert, howard2018universal] speech recognition [ravanelli2020multitask], medical imaging [raghu2019transfusion] and computer vision in general [jing2019self-supervised]. Indeed, computer vision models seem to benefit significantly from self-supervised learning as the amount of unlabeled data can be very large, while labeling can be expensive [jing2019self-supervised]. In this context, many visual pre-training tasks have been proposed, such as image rotation prediction [gidaris2018unsupervised], colorization [colorization], and solving jigsaw puzzles [jigsaw]. In addition to these context-based and generation-based pre-training methods, one can also leverage pseudo-labels from a pre-trained model in free semantic label-based methods [jing2019self-supervised]. In our work we therefore add a depth prediction pretext task, as advocated by [doersch2017multitask], based on inferences from MegaDepth [li2018megadepth]. As the number of pretext tasks increases, so does the memory and computational time needed to process samples. This is especially problematic for generation-based methods which can be as computationally and memory intensive as the downstream task’s model. We cannot therefore hope to train large models such as those used in I2IT, in parallel with all these tasks.
One must therefore derive a learning procedure which ensures that the networks do not forget as they change tasks: this is the focus of continual (or lifelong) learning. Neural networks have been plagued by the inability to maintain performance on previously accomplished tasks when they are trained on new ones - a phenomenon that has been coinedcatastrophic forgetting [kirkpatrick2017overcoming]. Various continual learning methods have been developed to mitigate forgetting which can be categorized as follows [lange2019continual]: replay-based methods, regularization-based methods and parameter isolation methods. In their work, matsumotocontinual use the parameter-isolation method PiggyBack [mallya2018piggyback] in order to learn a sequence of I2IT tasks without forgetting the previous ones. zhai2019lifelong on the other hand uses distillation [hinton] in order to perform such tasks. In this work, we borrow ideas from the latter and apply them to a sequence of self-supervised tasks followed by a translation task.
Our main contribution is a continual learning framework which maintains self-supervised performance and improves unpaired I2IT representations. We chose as our I2IT model the simple and well-understood CycleGAN [cyclegan].
Let be a set of tasks such that is a set of self-supervised tasks and is an I2IT task such as horsezebra [cyclegan]. The model is composed of two domain-specific sets of networks and where is a multi-headed generator and is a set of discriminators (1 per generative pretext task and 1 for the translation task). From now on, will be either or in which case is or . All the following is symmetric in and .
Let us focus on . It is composed of an encoder and a set of task-specific heads which map from the encoder’s output space to ’s output space. Let be a sample from domain and . In our work, we focus on the following 4 pretext tasks:
is a rotation task, inspired from [gidaris2018unsupervised], and performs a classification task: as there are 4 possible rotations (, , and ). When appropriate (see 2.2) we train with a cross-entropy objective .
is a jigsaw puzzle as introduced by [jigsaw]. We split the image into 9 equal sub-regions which we randomly reorder according to 64 pre-selected permutations (out of the possible ones): . Similarly, we train with a cross-entropy objective .
is a relative depth prediction task inspired from [jiang2017selfsupervised]: . is trained with an objective with respect to pseudo-labels obtained from a pre-trained MegaDepth model [li2018megadepth].
is a colorization task, as per [colorization, larsson2017colorproxy]: . Because a gray image can have several possible colorizations, we train by a mixture of loss with respect to and a GAN loss from a discriminator :
The downstream translation task is based on CycleGAN’s losses. For simplicity, in the following equations we call what is actually , that is to say the standard CycleGAN generator.
The overall model for domain is therefore composed of a shared encoder network and a set of heads which map from this latent space to their specific task’s output space. We now need to understand how these tasks can be combined together in order to enable forward transfer from each of the self-supervised tasks to the translation task, without forgetting.
2.2 Training Schedule
When trying to incorporate self-supervised learning ideas into the I2IT framework, one could naively train all the heads in parallel ( are scalars weighting the contribution of each loss):
As explained previously, not only is this approach slower in that each sample has to go through all heads, but it also forces us to use smaller batch sizes for memory constraints.
Another naive approach would be to perform each task sequentially. Given an ordering of , one could train the model with:
In this sequential training schedule, the model transitions from () to () according to some curriculum. For readability and without loss of generality, we omit from now on. In our experiments we implement a threshold-based curriculum where the transition from one task to the next depends on its performance (in both domains and ) on some validation metric (see Section 3.1).
In this paper we introduce Lifelong Self-Supervision, a continual schedule which is similar to the sequential with the addition of a distillation loss. Inspired by tarvainen2017ean, we maintain an exponential moving average of past encoder parameters, therefore keeping a weighted memory of all past encoders at the cost of single additional one. Formally, let be the frozen state of at the end of the task, i.e. when transitioning from to . Then we define the (non-trainable) reference encoder as follows:
With . We use in an additional distillation term in the loss, minimizing the distance between the current and reference encoded spaces:
These ideas are general and not specific to I2IT or CycleGAN ; this is why LiSS refers to , not to a specific model.
To evaluate the effect of LiSS on CycleGAN’s performance, we compare it with a baseline CycleGAN from [cyclegan] and to the two aforementioned naive training schedules: sequential and parallel. We compare these 4 models on the horsezebra dataset on a dataset of floodednon-flooded street-level scenes from [vicc] (the task is to simulate floods). As our goal is to understand how to efficiently leverage a set of given pretext tasks to improve representations, we keep constant across experiments.
All models are trained with the same hyper-parameters. We use the RAdam optimizer [radam] with a learning rate of . We keep and to their default values and set all to . We leave the analysis of ’s exact impact for future work and set it to across experiments (see Eq. 7). Results are compared after k translation steps. The continual and sequential models therefore have more total steps of training but in all cases is trained for k steps (i.e. 24 hours of training with the LiSS training schedule). We set a fixed curriculum as per section 2.1 with thresholds at 85% for classification tasks, and distance of 0.15 for regression tasks. These were set to be 95% of the parallel schedule’s final validation performance. Batch-size is set to 5, for LiSS and sequential schedules, but to 3 for the parallel schedule111these are the largest values which fit in a Nvidia V100’s 16GB of GPU memory..
3.2 Image-to-Image Translation
While our setup does not quite match the pixelwise translation performance of CycleGAN, the model learns some interesting semantic features. Unlike CycleGAN which tends to merge distinct instances of entities that are very close to each other (Figure 4’s image for instance), our model is able to disentangle these instances and retain them as separate entities after translation. We can also see from Figure 4’s image and Figure 4’s images and that CycleGAN relies on color-based features, as evidenced by the zebra stripes projected on the brown patch of ground and the sky artificats. On the other hand, adding self-supervised tasks makes the models less sensitive to such glaring errors (see rows below the aforementioned CycleGAN translations in Figures 4 and 4).
Compared to the parallel schedule, LiSS keeps relevant features without enforcing a continuous training of , which gives useful freedom to the model. It is able to have a similar translation performance and better color consistency, though one could argue that the parallel’s translations are visually slightly better.
The sequential schedule on the other hand seems to have slightly worse translation performance. We can see that some of the useful ”knowledge” the two other models still have is no longer available to the translation head as the smaller zebra is merged with the taller one in Figure 4’s image and the brown patch in image shows slight stripes.
3.3 Continual Learning Performance
Our main finding is that Lifelong Self-Supervision partially prevents forgetting. We can see in figures 5 and 6 that our formulation preserves the model from a forgetting as severe as in sequential training while providing enough flexibility for it to learn new tasks.
In both datasets, we observe that the naive training schedules behave as expected: the sequential one is able to learn new tasks the fastest as the model is less constrained. However, it is noticed that the sequential setup forgets previous tasks almost instantly as it changes its focus to a new task. On the other hand, the more constrained parallel schedule shows that continuously training on tasks allows the model to master them all at once. This however comes at a memory and time cost as we could not fit more than 3 samples per batch (vs 5 for the other schedules), and the average processing time per sample is much larger (0.27s against an average of 0.12s for the other schedules). This means that to complete 230k translation steps, the parallel schedule typically takes more than h when LiSS only takes h (counting all the pretext tasks).
Figures 5 and 6 show how LiSS maintains accuracies for the Rotation and Jigsaw tasks while performing slightly worse on the Depth prediction and Colorization tasks. As the encoder produces increasingly richer representations, the distillation loss prevents it from mapping input images to regions that would harm previous tasks. Because of our problem’s sequential nature, decoding heads do not change after they have achieved the curriculum’s required performance and the burden of producing stable yet informative features entirely relies on the encoders as the heads cannot adjust to its changes.
Table 1 and 2 show that it takes more steps for the tasks to be learnt with LiSS. Intuitively, when training sequentially, the encoders are free to adjust exactly to the task. When training with LiSS, they are more constrained and it takes more iterations for them to reach the same performance on pretext tasks. This constraint is however pliable enough for encoders to adjust to new tasks.
|Jigsaw||8 000||158 000|
|Depth||158 000||170 000|
|Colorization||170 000||172 000|
|Jigsaw||24 000||96 000|
|Depth||96 000||102 000|
|Colorization||102 000||108 000|
We propose a method, Lifelong Self-Supervision (LiSS), enabling CycleGAN to leverage sequential self-supervised auxiliary tasks to improve its representations. By distilling the knowledge of a reference encoder (which is an exponential moving average of previous encoders, in parameter space) we prevent catastrophic forgetting of the auxiliary tasks, thereby allowing CycleGAN to better disentangle instances of objects to be translated and rely less on colors. This framework can bring benefits of training on all the tasks at once at a much lower memory and computational cost as it only requires us to keep one additional encoder. Our exploratory experiments show encouraging results which will need further investigation in future work to produce a principled framework.
Open questions include the exact impact of the reference encoder’s algebra (namely exponential moving average versus other moving averages and the impact of ), a more thorough hyper-parameter search in order to tune and and achieve better pixel-level results. Additionally, exploring schedules and auxiliary tasks would allow for a better understanding of how SSL can improve unpaired I2IT models. Finally, while CycleGAN’s simplicity allowed us to isolate LiSS’s contribution to improved translations, exploring its capabilities on more complex architecture is a promising direction for future work.
Appendix A Implementation details
|Jigsaw||24 000||158 000|
|Depth||158 000||174 000|
|Colorization||174 000||176 000|
|Jigsaw||28 000||114 000|
|Depth||114 000||122 000|
|Colorization||122 000||124 000|
Our framework’s network architecture follows the baseline CycleGAN [cyclegan] with some differences in the generator to support self supervision. We use “ResnetBlock” to denote residual blocks [residual_blocks]. “CHW-S-P Conv” represents a convolutional layer with C channels having kernel size Hconv_transpose followed by instance norm.
Discriminator Network Architecture.
We use PatchGANs [isola2017image_patchgan, li2016precomputed_patchgan, ledig2017photo_patchgan] as the one used in the original CycleGAN [cyclegan] baseline model shown in Table 3. The discriminator’s output is a real or fake label for overlapping
patches. The GAN loss function then compares the target’s label real or fake to the average of patches predictions of the input image.
Encoder Network Architecture.
The encoder network’s architecture is inspired from [johnson2016perceptual], as shown in Table 4. The network starts with a reflection padding of size 3 and zero padded 7x7 convolutions to avoid severe artifacts around the borders of the generated images, followed by 3x3 convolutional blocks with padding 1 and stride 2 to downsample the input image and finally by 3 Residual Blocks.
Translation and Colorization Head Architectures
The translation head’s network’s architecture follows the standard CycleGAN generator [cyclegan] as shown in Table 5. It consists of 3 residual blocks followed by upsampling convolutions. For colorization to share the encoder with other tasks, we repeat gray scale images along the channel dimension.
Rotation Network Architecture.
The rotation head’s architecture is inspired from [gidaris2018unsupervised] and shown in Table 6. The network performs a simple classification task out of 4 possible rotations (, , and ).
Jigsaw Network Architecture.
Jigsaw’s network predicts the correct indices order of shuffled patches of an input image. The network consists of a set of convolutions extracting useful features from input image and then a fully connected layer to map it to the possible permutations. The model’s architecture shown in Table 7 performs a classification task over 64 possible permutations of shuffled images order.
Depth Prediction Network Architecture.
Depth network architecture is inspired from [jiang2017selfsupervised] and shown in Table 8. The network is trained on labels predicted using a pre-trained MegaDepth Model [li2018megadepth].