We propose a novel self-supervised semi-supervised learning approach for conditional Generative Adversarial Networks (GANs). Unlike previous self-supervised learning approaches which define pretext tasks by performing augmentations on the image space such as applying geometric transformations or predicting relationships between image patches, our approach leverages the label space. We train our network to learn the distribution of the source domain using the few labelled examples available by uniformly sampling source labels and assigning them as target labels for unlabelled examples from the same distribution. The translated images on the side of the generator are then grouped into positive and negative pairs by comparing their corresponding target labels, which are then used to optimise an auxiliary triplet objective on the discriminator's side. We tested our method on two challenging benchmarks, CelebA and RaFD, and evaluated the results using standard metrics including Frechet Inception Distance, Inception Score, and Attribute Classification Rate. Extensive empirical evaluation demonstrates the effectiveness of our proposed method over competitive baselines and existing arts. In particular, our method is able to surpass the baseline with only 20 of the labelled examples used to train the baseline.READ FULL TEXT VIEW PDF
Conditional GANs Odena et al. (2017); Mirza and Osindero (2014); Choi et al. (2018); Isola et al. (2017) provides greater flexibility in generating and manipulating synthetic images. The deployment of such algorithms, however, can be impeded by their need for a large number of annotated training examples. For instance, the size of commonly used labelled datasets for training conditional GANs Choi et al. (2018); Liu et al. (2019); He et al. (2019); Mirza and Osindero (2014)
such as CelebA and ImageNet is in the order ofto
. We aim to substantially reduce the need of such huge labelled datasets. A promising direction forward is to adopt self-supervised learning approaches which are being successfully employed in a wide range of computer vision applications including image classificationGidaris et al. (2018); Zhai et al. (2019), semantic segmentation Zhan et al. (2018), robotics Jang et al. (2018), and many more.
Recently, self-supervised learning is also gaining traction with GAN training Lučić et al. (2019); Chen et al. (2019); Tran et al. (2019). Prior work in this area Lučić et al. (2019); Chen et al. (2019) mostly focuses on the input space when designing the pretext task. In particular, Chen et al. (2019) proposed rotating images and minimising an auxiliary rotation angle loss similar to that of Gidaris et al. (2018). Lučić et al. (2019) also adopted a self-supervised objective similar to that of Gidaris et al. (2018) in a semi-supervised setting. In general, existing methods mostly incorporate geometric operations on the input image space as part of the pretext task. In this work, however, we propose to perform augmentations on the output space. Specifically, we utilise the unlabelled examples in a semi-supervised setting to generate a large number of additional labelled examples for the pretext task. Hence, our approach is orthogonal to existing methods.
Our idea is inspired by Nair et al. (2018)
which involves setting randomly generated imagined goals for an agent in a reinforcement learning environment. Similar to this work, given the unlabelled examples from the same distribution as the labelled ones, irrespective of their true source attribute labels, conditional GANs should map the source images to similar or near similar regions of the synthetic image manifold if given the same target label and different regions if given different labels. To satisfy such a constraint, we propose to create a large pool of labelled examples by uniformly sampling labels from the source domain and assigning them to unlabelled data as their target labels. Based on the target labels, we create triplets of synthetic images as additional training examples for our pretext task (as illustrated in Figure1). Similarly, we also use the limited number of source labels of real images from the labelled pool to create such triplets. These triplets from real examples play an important role in distilling knowledge from the synthetic examples generated from unlabelled data Li et al. (2017) with randomly assigned target labels, as it is crucial to have both source and target labels to faithfully translate images to target attributes Choi et al. (2018); Zhu et al. (2017); Liu et al. (2019). As for the pretext task, we introduce an additional triplet loss as a self-supervised learning objective which is optimised for both the generator and discriminator. This additional objectives serves two purposes. First, it alleviates the overfitting problem for the discriminator in a semi-supervised setting as these triplets serve as additional supervision for the network. Second, as a self-supervised method, it can also help address the discriminator forgetting problem discussed by Chen et al. (2019). In addition, unlike Chen et al. (2019) which is geometric in nature, our method focuses on labels and is more in line with our main task of editing attributes, acting as an auxiliary task in a multi-task learning setup Caruana (1997).
We evaluated our method on two challenging benchmarks, CelebA and RaFD, which are popular benchmarks for face attribute and expression translations. We take StarGAN Choi et al. (2018) as our baseline conditional GAN, but our method is generic in nature. We compared the results both quantitatively and qualitatively. We used standard metrics FID and Inception score for quantitative comparisons.
Self-supervised learning has been successfully employed to fill the gaps of unsupervised and supervised learning frameworks. In particular, numerous self-supervised approaches have been proposed for image classification tasks, including predicting relative positioning of image patches Doersch et al. (2015), generating image content from surroundings Pathak et al. (2016), colouring greyscale images Zhang et al. (2016), counting visual premitives Noroozi et al. (2017), and predicting rotation angles Gidaris et al. (2018). These methods largely focus on geometric transformations on the input image space.
Semi-supervised learning methods become relevant in situations where there are limited number of labelled examples and a large number of unlabelled ones. One of the popular approaches is to annotate unlabelled data with pseudo-labels and treating the whole dataset as if it is fully labelled Lee (2013). Self-supervised approaches have also been explored in semi-supervised learning settings, for example Zhai et al. (2019) employed the rotation loss Gidaris et al. (2018) and was able to surpass the performance of fully-supervised methods with a fraction of labelled examples.
Self-supervised approaches are also becoming more popular with GANs. Chen et al. (2019) proposed to minimise a rotation loss similar to that of Gidaris et al. (2018) on the discriminator. To train conditional GANs in a semi-supervised setting, Lučić et al. (2019)
proposed to train a classifier with the limited labelled examples available and employ the suboptimal classifier to annotate the unlabelled data. This method, however, is dependent on the performance of the classifier and cannot be used for cross-domain training, whereas our proposed method is agnostic to the source domain information. Approaches similar to ours are also successfully used in domain agnostic policy generation in reinforcement learningJang et al. (2018); Nair et al. (2018).
Our task is to perform image-to-image translation between different categories in a semi-supervised setting where themajority of training examples are unlabelled except for a small number of examples which are labelled. As training a large network in these scenarios could easily lead to overfitting, we aim to mitigate the problem by providing weak supervision using the large number of unlabelled examples available. In short, we propose to utilise the translated images and their corresponding target labels as extra labelled examples for training GANs in a pretext task. The goal of the pretext task is to train both the generator and discriminator to minimise an auxiliary triplet loss by gathering positive and negative pairs of images for classification in a manner akin to metric learning. Compared to optimising a cross-entropy loss across all label categories, this approach is more efficient, easily scalable to large numbers of label categories, and has been successfully adopted by prior work such as one-shot learning Koch et al. (2015)
and face recognitionSchroff et al. (2015).
We use StarGAN Choi et al. (2018)
as the baseline for our experiments, and as a result we will give a brief overview of its architecture and loss functions before introducing our method.
Here, we provide a brief background on conditional GANs taking reference from StarGAN Choi et al. (2018). However, our method is generic in nature and can be applied to any other conditional GAN. StarGAN aimed to tackle the problem of multi-domain image-to-image translation without having to train multiple sets of GANs between each pair of domains. It accomplished this by encoding target domain information as binary or one-hot labels and feeding them along with source images to the generator. It consists of a 12-layer generator , which has an encoder-decoder architecture, and a 6-layer discriminator . To train the network, is required to translate a source image conditioned on a random target domain label . The discriminator receives an image and produces an embedding , which is then used to produce two outputs and . The former, , is used to optimise the Wasserstein GAN with gradient penalty Gulrajani et al. (2017) defined by
where consists of points uniformly sampled from straight lines between and the synthetic distribution . The latter output
is a probability distribution over labels used for optimising a classification loss to help guidetowards generating images that more closely resemble the target domain. The classification loss for and are given by
is the joint distribution between imagesand their associated ground truth labels . In addition, a cycle-consistency loss Zhu et al. (2017), defined as
is incorporated to ensure that preserves content unrelated to the domain translation task. The overall objective for StarGAN is given by
Whilst existing self-supervised methods mostly rely on geometric transformation of input images, we take the inspiration of utilising target domain information from reinforcement learning literature. Jang et al. (2018) learned an embedding of object-centric images by comparing the difference prior to and after an object is grasped. However, this requires source information which can be scarce in semi-supervsied learning settings. Nair et al. (2018)
utilised a variational autoencoderKingma and Welling (2014) to randomly generate a large amount of goals to train the agent in a self-supervised manner. Similar to Nair et al. (2018), our self-supervsied method involves translating unlabelled images to randomly chosen target domains and using the resulting synthetic images to optimise a triplet matching objective.
Recently, Chen et al. (2019) proposed to minimize a rotation loss on the discriminator to mitigate its forgetting problem due to the continuously changing generator distribution. However, it uses a categorical cross-entropy loss limited to four rotations, which is not scalable with the number of categories. Imposing constraints on the discriminator to match the distribution of attributes forces the generator to maintain consistency on translated attributes, ultimately allowing attributes to be retained better on synthetic images. Hence, we propose an auxiliary triplet loss based on label information as a pretext task for both and . A triplet consists of an anchor example , a positive example which shares the same label information as , and a negative example which has a different label. We concatenate the discriminator embeddings of the positive pair and negative pair respectively along the channel axis and feed them through a single convolutional layer, producing probability distributions and respectively over whether each pair has matching labels. Specifically, we propose the following triplet matching objective
Hence, our overall loss function is given by
We used StarGAN Choi et al. (2018) as a baseline for our experiments and implemented our method directly on top of its architecture. StarGAN unifies multi-domain image-to-image translation with a single generative network and is well suited to our label-based self-supervised approach. However, we would like to re-emphasise that our method is a general idea and can be extended to other conditional GANs.
To avoid potential issues such as non-convergence or mode collapse during training, we used the same hyperparameters as the original StarGANChoi et al. (2018). Specifically, we trained the network for 200k discriminator iterations with 1 generator update after every 5 discriminator updates. We used the Adam optimiser Kingma and Ba (2015) with and , and the initial learning rates for both generator and discriminator were set to for the first 100k iterations and decayed linearly to 0 for the next 100k iterations. We used a batch size of 16, in which 4 classes were randomly selected for and respectively (i.e. 4 examples per class). Training took approximately 10 hours to complete on an NVIDIA RTX 2080Ti GPU.
We evaluated our method on two challenging face attributes and expression manipulation datasets, The CelebFaces Attributes Dataset (CelebA) Liu et al. (2015) and The Radboud Faces Database (RaFD) Langner et al. (2010).
CelebA. CelebA contains 202,599 images of celebrities of size with 40 attribute annotations. In our experiment, we selected 5 attribute annotations including 3 hair colours (black, blond, and brown), gender, and age. The images were cropped to then resized to . We followed the official partition of 162,770 examples for training, 19,867 for validation and 19,962 for testing. We created a semi-supervised scenario with limited labelled training examples by uniformly sub-sampling a percentage of training examples as labelled and setting the rest as unlabelled. The sub-sampling process was done to ensure that the examples were spread evenly between classes whenever possible in order to avoid potential problems caused by class imbalance.
RaFD. RaFD is a much smaller dataset with 8,040 images of size of 67 identities of different genders, races, and ages displaying 8 emotional expressions. The images were cropped to (centred on face) before being resized to . We randomly selected 7 identities comprising 840 images as the test set and the rest (60 identities comprising 7200 images) as training set. Similar to CelebA, we created a semi-supervised setting by splitting the training set into labelled and unlabelled pools. We report all results on the test set.
To verify that our method is scalable to both small and large number of annotated data, we tested our approach with various percentages of training examples labelled. Specifically, we performed experiments setting 1%, 5%, 10%, and 20% of CelebA training data as labelled examples, and similarly for 10%, 20%, and 50% of RaFD training data as RaFD is a significantly smaller dataset. Finally, we also evaluated our method on the full datasets to verify the effectiveness of our method on benchmarks designed for supervised learning. We also tested the rotation loss Chen et al. (2019) for comprison. Our baseline was established by setting to 0 whilst leaving all other procedures unchanged. As for our proposed method MatchGAN, the value of was used for all experiments.
We employed the Fréchet Inception Distance (FID) Heusel et al. (2017) and Inception Score (IS) Salimans et al. (2016) for quantitative evaluations. The FID measures the similarity between the distribution of real examples and that of the generated ones by comparing their respective embeddings of a pretrained Inception-v3 network Szegedy et al. (2016). The Inception Score also measures image quality but relies on the probability outputs of the same Inception-v3 network, taking into account the meaningfulness of individual images and the diversity of all images. If the generated images are of high quality, then FID should be small whereas IS should be large. We computed the FID by translating test images to each of the target attribute domains (5 for CelebA, 8 for RaFD) and comparing the distributions before and after translations. The IS score was computed as an average obtained from a 10-fold split of the test set.
In addition to FID and IS, we also used GAN-train and GAN-test Shmelkov et al. (2018) to measure the attribute classification rate of translated images. In short, given a set of real images with a train-test split , GAN-train is the accuracy obtained from a classifier trained on synthetic images and tested on real images , whereas for GAN-test the classifier is trained on real images and tested on synthetic images .
Our proposed method involves extracting triplets from labelled real examples and all synthetic examples - labelled and unlabelled. To show that our method does not simply rely on the few labelled examples and that both unlabelled and synthetic examples are necessary to achieve good performance, we trained our network in three separate scenarios by removing various amounts of real and synthetic data used for updating the match loss (shown in Table 1). Three observations can be made from this table. First, the difference between “Baseline” and “Full” shows that our proposed method achieves substantial improvements over the baseline. Second, can indeed benefit from a large number of unlabelled examples as seen by comparing “Labelled only” and “Full”. Third, synthetic examples can be utilised by to achieve further performance increase, which is evident from the comparison between “Real only” and “Full”. As a result, the setup “Full’ will be used for all following experiments involving the match loss.
|Setup||Baseline||Labelled only||Real only||Full|
|No. of examples for updating||Real (labelled)||0||2.5k||2.5k||2.5k|
|Total no. of examples used for training||162k||2.5k||162k||162k|
We evaluated the performance of our proposed method, the baseline, and rotation loss Chen et al. (2019) using FID and IS and the results are shown in Table 2. In terms of FID, our method consistently outperformed the baseline in both CelebA and RaFD. For CelebA in particular, with just 20% of training examples labelled, our method was able to achieve better performance than the baseline with 100% of the training examples labelled. Our method also has a distinct lead over the baseline when there are very few labelled examples. In addition, our method was also able to match or outperform rotation loss in both datasets, again with a distinct advantage over rotation loss when labelled examples are limited.
In terms of IS, we still managed to outperform both the baseline and rotation loss in the majority of the setups. In other setups our method was either on par with the baseline or slightly underperforming within a margin of 0.02. We would like to emphasise that IS is less consistent than FID as it does not compare the synthetic distribution with an “ideal” one. In addition, IS is computed using the 1000-dimensional output of Inception-v3 pretrained on ImageNet which is arguably less than suitable for human face datasets such as CelebA and RaFD. However, we included IS here as it is still one of the most widely used metrics for evaluating the performance of GANs.
|Dataset||Metric||Setup||Percentage of training data labelled|
|CelebA||FID||Baseline Choi et al. (2018)||17.04||10.54||9.47||7.07||[dir=NE]||6.65|
|Rotation Chen et al. (2019)||17.08||10.00||8.04||6.82||[dir=NE]||5.91|
|IS||Baseline Choi et al. (2018)||2.86||2.95||3.00||3.01||[dir=NE]||3.01|
|Rotation Chen et al. (2019)||2.82||2.99||2.96||3.01||[dir=NE]||3.06|
|RaFD||FID||Baseline Choi et al. (2018)||[dir=NE]||[dir=NE]||32.015||11.75||7.24||5.14|
|Rotation Chen et al. (2019)||[dir=NE]||[dir=NE]||28.88||10.96||6.57||5.00|
|IS||Baseline Choi et al. (2018)||[dir=NE]||[dir=NE]||1.66||1.60||1.58||1.56|
|Rotation Chen et al. (2019)||[dir=NE]||[dir=NE]||1.62||1.58||1.58||1.60|
In terms of GAN-train and GAN-test classification rates, our method outperformed the baseline in both CelebA and RaFD (shown in Table 3) under the 100% setup which has the best FID overall. In both cases, GAN-train in particular was significantly higher than GAN-test, which indicates that the synthetic examples generated using our method can be effectively used to augment small data for training classifiers. We report the results under the 100% setup as it has the lowest FID and that FID is considered one of the most robust metrics for evaluating the performance of GANs. We expect GAN-train and GAN-test in other setups to be proportional to their respective FIDs as well.
|CelebA||Baseline Choi et al. (2018)||87.29%||81.11%|
|RaFD||Baseline Choi et al. (2018)||95.00%||75.00%|
Figure 3 compares the visual quality of the images generated by Baseline and MatchGAN. Our method was able to produce images that are less noisy and more coherent. For instance in the 1.67% setup, the Baseline can be observed to generate artefacts and blurry patches which are not present in the images generated by MatchGAN. The image quality of our method also improves substantially with more labelled exmaples. In particular, the overall quality of the images generated with our method in the 20% setup was on par with or even outmatches that of the Baseline under the 100% setup in terms of clarity, colour tone, and coherence of target attributes, corroborating our quantitative results shown in Table 2.
In this paper we proposed MatchGAN, a novel self-supervised learning approach for training conditional GANs under a semi-supervised setting with very few labelled examples. MatchGAN utilises synthetic examples and their target labels as additional annotated examples and minimises a triplet matching objective as a pretext task. With 20% of the training data labelled, it is able to outperform the baseline trained with 100% of examples labelled and shows a distinct advantage over other self-supervised approaches such as Chen et al. (2019) under both fully-supervised and semi-supervised settings.
Data-based machine learning and deep learning approaches have become extremely successful in the past decade and have been used in a wide range of applications including computer vision, image recognition, natural language processing, object detection, cancer image classification, astronomical image processing and many more. A significant drawback of these approaches, however, is that training these models to reach a competent level usually requires an enormous amount of annotated data which can be expensive or impractical to obtain in certain areas. Even if there is no shortage of data, annotating them can be extremely laborious and sometimes requires specific disciplinary knowledge.
Our work seeks to alleviate this problem by incorporating labelled synthetic data into the training process, which can drastically reduce the amount of annotated data required to achieve satisfactory performance. This allows deep learning to be more accessible and can substantially benefit researchers from disciplines where data or annotations are difficult to obtain. Whilst a potential disadvantage is that realistic synthetic images of human faces could be used to fool deep neural networks used for security purposes, one should not ignore the more important positive side that they can be used as adversarial examples to help these networks to better defend against such attacks. Our method is also a general idea for conditional GANs and training classifiers on top of our synthetic data for other applications is beyond the scope of our research. In the case that our method fails, it would only lead to bad-quality examples being generated which researchers could simply discard. Overall, our idea is generic in nature and can be applied to any conditional GANs as a means to reduce their reliance on annotated data. This work is partially supported by Huawei CBG (Consumer Business Group) and EPSRC Programme Grant ‘FACER2VM’ (EP/N007743/1).
Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §1.
Colorful image colorization. In ECCV, Cited by: §2.
In this work, we use the exact same generator and discriminator architecture for MatchGAN as StarGAN and it is fully convolutional. The generator consists of 3 downsampling convolutional layers, 6 bottleneck residual blocks, and 3 upsampling convolutional layers. Each downsampling or upsampling layer halves or doubles the length and width of the input. Instance normalisation and ReLU activation is used for all layers except the last layer, where the Tanh function is used instead. As for the discriminator, it consists of 6 downsampling convolutional layers with leaky ReLU activations with a slope of 0.01 for negative values. Letbe an input image. The discriminator output has channels and is then fed through two additional convolutional layers and to produce outputs for adversarial loss and classification loss respectively. performs a convolution and reduces the number of input channels from 2048 to 1 whilst maintaining the height and width. This output is then averaged across the spatial dimensions to arrive at a scalar value. performs a convolution using a kernel with the same size as but its output is of size where
is the number of classes (e.g. 8 for RaFD). This is then fed through either a Softmax layer to produce a probability distribution (for RaFD) or a Sigmoid layer for multi-class binary classification (for CelebA). With an input image size of, StarGAN has 53.22M learned parameters in total, comprising 8.43M from the generator and 44.79M from the discriminator.
MatchGAN. To produce an output for the match loss, we create an additional layer after the discriminator output as illustrated in Figure 4. Specifically, a triplet of images are passed through the discriminator to produce embeddings . The positive and negative pairs and are concatenated separately along the channel dimension to produce 4096-channel embeddings. Afterwards, they are convolved to produce outputs of size , which are then fed through a Softmax layer to produce probability distributions over whether each image pair is matched or mismatched. For input images of size , this layer adds approximately 32.77K to the total number of learned parameters which is negligible compared to the 53.22M parameters in the StarGAN baseline, and thus has very little impact on training efficiency.