Code for the paper "DisCont: Self-Supervised Visual Attribute Disentanglement using Context Vectors".
Disentangling the underlying feature attributes within an image with no prior supervision is a challenging task. Models that can disentangle attributes well provide greater interpretability and control. In this paper, we propose a self-supervised framework DisCont to disentangle multiple attributes by exploiting the structural inductive biases within images. Motivated by the recent surge in contrastive learning paradigms, our model bridges the gap between self-supervised contrastive learning algorithms and unsupervised disentanglement. We evaluate the efficacy of our approach, both qualitatively and quantitatively, on four benchmark datasets.READ FULL TEXT VIEW PDF
Many self-supervised learning (SSL) methods have been successful in lear...
We propose a self-supervised framework for learning facial attributes by...
Visual aesthetic assessment has been an active research field for decade...
Visual attention learning (VAL) aims to produce a confidence map as weig...
This paper presents a technique called evolving self-supervised neural
Classical self-supervised networks suffer from convergence problems and
Contrastive learning has shown remarkable results in recent self-supervi...
Code for the paper "DisCont: Self-Supervised Visual Attribute Disentanglement using Context Vectors".
Real world data like images are generated from several independent and interpretable underlying attributes (Bengio, 2013). It has generally been assumed that successfully disentangling these attributes can lead to robust task-agnostic representations which can enhance efficiency and performance of deep models (Schölkopf et al., 2012; Bengio et al., 2013; Peters et al., 2017). However, recovering these independent factors in a completely unsupervised manner has posed to be a major challenge.
Recent approaches to unsupervised disentanglement have majorly used variants of variational autoencoders(Higgins et al., 2017; Kim and Mnih, 2018; Chen et al., 2018; Kim et al., 2019) and generative adversarial networks (Chen et al., 2016; Hu et al., 2018; Shukla et al., 2019). Further, such disentangled representations have been utilized for a diverse range of applications including domain adaptation (Cao et al., 2018; Vu and Huang, 2019; Yang et al., 2019), video frame prediction (Denton and Birodkar, 2017; Villegas et al., 2017; Hsieh et al., 2018; Bhagat et al., 2020), recommendation systems (Ma et al., 2019) and multi-task learning (Meng et al., 2019).
Contrary to these approaches, (Locatello et al., 2018) introduced an ‘impossibility result’ which showed that unsupervised disentanglement is impossible without explicit inductive biases on the models and data used. They empirically and theoretically proved that without leveraging the implicit structure induced by these inductive biases within various datasets, disentangled representations cannot be learnt in an unsupervised fashion.
Inspired by this result, we explore methods to exploit the spatial and structural inductive biases prevalent in most visual datasets (Cohen and Shashua, 2016; Ghosh and Gupta, 2019). Recent literature on visual self-supervised representation learning (Misra and van der Maaten, 2019; Tian et al., 2019; He et al., 2019; Arora et al., 2019; Chen et al., 2020) has shown that using methodically grounded data augmentation techniques using contrastive paradigms (Gutmann and Hyvärinen, 2010; van den Oord et al., 2018; Hénaff et al., 2019)
is a promising direction to leverage such inductive biases present in images. The success of these contrastive learning approaches in diverse tasks like reinforcement learning(Kipf et al., 2019; Srinivas et al., 2020), multi-modal representation learning (Patrick et al., 2020; Udandarao et al., 2020) and information retrieval (Shi et al., 2019; Le and Akoglu, 2019)
further motivates us to apply them to the problem of unsupervised disentangled representation learning.
In this work, we present an intuitive self-supervised framework DisCont to disentangle multiple feature attributes from images by utilising meaningful data augmentation recipes. We hypothesize that applying various stochastic transformations to an image can be used to recover the underlying feature attributes. Consider the example of data possessing two underlying attributes, i.e, color and position. To this image, if we apply a color transformation (eg. color jittering, gray-scale transform), only the underlying color attribute should change but the position attribute should be preserved. Similarly, on applying a translation and/or rotation to the image, the position attribute should vary keeping the color attribute intact.
It is known that there are several intrinsic variations present within different independent attributes (Farhadi et al., 2009; Zhang et al., 2019). To aptly capture these variations, we introduce ‘Attribute Context Vectors’ (refer Section 2.2.2). We posit that by constructing attribute-specific context vectors that learn to capture the entire variability within that attribute, we can learn richer and more robust representations.
Our major contributions in this work can be summarised as follows:
We propose a self-supervised method DisCont to simultaneously disentangle multiple underlying visual attributes by effectively introducing inductive biases in images via data augmentations.
We highlight the utility of leveraging composite stochastic transformations for learning richer disentangled representations.
We present the idea of ‘Attribute Context Vectors’ to capture and utilize intra-attribute variations in an extensive manner.
We impose an attribute clustering objective that is commonly used in distance metric learning literature, and show that it further promotes attribute disentanglement.
The rest of the paper is organized as follows: Section 2 presents our proposed self-supervised attribute disentanglement framework, Section 3 provides empirical verification for our hypotheses using qualitative and quantitative evaluations, and Section 4 concludes the paper and provides directions for future research.
In this section, we start off by introducing the notations we follow, move on to describing the network architecture and the loss functions employed, and finally, illustrate the training procedure and optimization strategy adopted.
Assume we have a dataset containing images, where , consisting of labeled attributes . These images can be thought of as being generated by a small set of explicable feature attributes. For example, consider the CelebA dataset (Liu et al., 2015) containing face images. A few of the underlying attributes are hair color, eyeglasses, bangs, moustache, etc.
From (Do and Tran, 2020), we define a latent representation chunk as ‘fully disentangled’ w.r.t a ground truth factor if is fully separable from and is fully interpretable w.r.t . Therefore, we can say for such a representation, the following conditions hold:
where, denotes the mutual information between two latent chunks while denotes the entropy of the latent chunk w.r.t attribute. To recover these feature attributes in a self-supervised manner while ensuring attribute disentanglement, we propose an encoder-decoder network (refer Fig 1) that makes use of contrastive learning paradigms.
To enforce the learning of rich and disentangled attributes, we propose to view the underlying latent space as two disjoint subspaces.
: denotes the feature attribute space containing the disentangled and interpretable attributes, where and denote the dimensionality of the space and the number of feature attributes respectively.
Assume that we have an invertible encoding function parameterized by , then each image can be encoded in the following way:
where, we can index to recover the independent feature attributes i.e. . To project the latent encodings back to image space, we make use of a decoding function parameterized by . Therefore, we can get image reconstructions using the separate latent encodings.
Following our initial hypothesis of recovering latent attributes using stochastic transformations, we formulate a mask-based compositional augmentation approach that leverages positive and negative transformations.
Assume that we have two sets of stochastic transformations and that can augment an image into a correlated form i.e. . denotes the positive set of transformations that when applied to an image should not change any of the underlying attributes, whereas denotes the negative set of transformations that when applied to an image should change a single underlying attribute, i.e., when is applied to an image , it should lead to a change only in the attribute and all other attributes should be preserved.
Taking inspiration from (van den Oord et al., 2018), we propose attribute context vectors . A context vector is formed from each of the individual feature attributes through a non-linear projection. The idea is to encapsulate the batch invariant identity and variability of the attribute in . Hence, each individual context vector should capture an independent disentangled feature space of the individual factors of variation. Assume a non-linear mapping function , where denotes the dimensionality of each context vector and denotes the size of a sampled mini-batch. We construct context vectors by aggregating all the feature attributes locally within the sampled minibatch.
We describe the losses that we use to enforce disentanglement and interpretability within our feature attribute space.
We have a reconstruction penalty term to ensure that and encode enough information in order to produce high-fidelity image reconstructions.
To ensure that the unspecified attribute acts as a generative latent code that encodes the arbitrary features within an image, we enforce the ELBO KL objective (Kingma and Welling, 2013) on .
where , and are the densities of arbitrary continuous distributions and respectively.
We additionally enforce clustering using center loss in the feature attribute space by differentiating inter-attribute features. This metric learning training strategy (Wen et al., 2016) promotes accumulation of feature attributes into distantly-placed clusters by providing additional self-supervision in the form of pseudo-labels obtained from the context vectors.
The center loss enforces the increment of inter-attribute distances, furthermore, diminishing the intra-attribute distances. We make use of the function to project the feature attributes into the context vector space and then apply the center loss given by Equation 5.
where, context vectors function as centers for the clusters corresponding to the attribute.
We also ensure that the context vectors do not deviate largely across gradient updates, by imposing a gradient descent update on the context vectors.
Finally, to ensure augmentation-specific consistency within the feature attributes, we propose a feature-wise regularization penalty . We first generate the augmented batch and mask using Algorithm 1. We then encode to produce augmented feature attributes and unspecified attribute in the following way:
Now, since we want to ensure that a specific negative augmentation enforces a change in only the feature attribute , we encourage the representations of and to be close in representation space. Therefore, this enforces each feature attribute to be invariant to specific augmentation perturbations. Further, since the background attributes of images should be preserved irrespective of the augmentations applied, we also encourage the proximity of and . This augmentation-consistency loss is defined as:
The overall loss of our model is a weighted sum of these constituent losses.
Informativeness. To ensure a robust evaluation of our disentanglement models using unsupervised metrics, we compare informativeness scores (defined in (Do and Tran, 2020)) of our model’s latent chunks in Fig 2 with the state-of-the-art unsupervised disentanglement model presented in (Hu et al., 2018)
(which we refer as MIX). A lower value of informativeness score suggests a better disentangled latent representation. Further details about the evaluation metric can be found inAppendix E.
Latent Visualization. We present latent visualisation for the test set samples with and without the unspecified chunk, i.e. . The separation between these latent chunks in the projected space manifests the decorrelation and independence between them. Here, we present latent visualisations for the dsprites dataset, the same for the other datasets can be found in Appendix G.
Attribute Transfer. We present attribute transfer visualizations to construe the quality of disentanglement. The images in the first two rows in each grid are randomly sampled from the test-set. The bottom row images are formed by swapping one specified attribute from the top row image with the corresponding attribute chunk in the second row image, keep all other attributes fixed. This allows us to quantify the purity of attribute-wise information captured by each latent chunk. We present these results for Cars3D and 3DShapes here, while the others in Appendix F.
In this paper, we propose a self-supervised attribute disentanglement framework DisCont which leverages specific data augmentations to exploit the spatial inductive biases present in images. We also propose ‘Attribute Context Vectors’ that encapsulate the intra-attribute variations. Our results show that such a framework can be readily used to recover semantically meaningful attributes independently.
We would like to thank Dr. Saket Anand (IIIT Delhi) for his guidance in formulating the initial problem statement, and valuable comments and feedback on this paper.
Investigating convolutional neural networks using spatial orderness. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §1.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. Cited by: §1.
Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991. Cited by: §1.
Deepchannel: salience estimation by contrastive learning for extractive document summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6999–7006. Cited by: §1.
A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Cited by: §2.3.
This section describes the generation of the mask and the augmented batch for the computation of the augmentation-consistency loss (refer Equation 6). The entire algorithm is detailed below:
Input: A batch of images , the set of positive transformations , the set of negative transformations , number of feature attributes Output: The augmented batch , the mask Initialize , for to do if then end if end for for to do if then end if end for return
The set of augmentations and their sampling range of parameters that we use in this work are detailed in the table below:
|Gaussian Noise||[0.5, 1, 2, 5]|
|Gaussian Smoothing||[0.1, 0.2, 0.5, 1]|
|Rotation||[90°, 180°, 270°]|
|Crop & Resize||–||–|
|Cutout||Length||[5, 10, 15, 20]|
We use the following datasets to evaluate our model performance:
Sprites (Reed et al., 2015) is a dataset of 480 unique animated caricatures (sprites) with essentially 6 factors of variations namely gender, hair type, body type, armor type, arm type and greaves type. The entire dataset consists of 143,040 images with 320, 80 and 80 characters in train, test and validation sets respectively.
Cars3D (Reed et al., 2015) consists of a total of 17,568 images of synthetic cars with elevation, azimuth and object type varying in each image.
3DShapes (Burgess and Kim, 2018) is a dataset of 3D shapes generated using 6 independent factors of variations namely floor color, wall color, object color, scale, shape and orientation. It consists of a total of 480,000 images of 4 discrete 3D shapes.
dSprites (Matthey et al., 2017) is a dataset consisting of 2D square, ellipse, and hearts with varying color, scale, rotation, x and y positions. In total, it consists of 737,280 gray-scale images.
We use a single experimental setup across all our experiments. We implement all our models on an Nvidia GTX 1080 GPU using the PyTorch framework. Architectural details are provided inTable 2. The training hyperparameters used are listed in Table 3.
, 64, ELU, stride 2, BN
, 1024, ReLU, BN
|FC, , 4096, ReLU|
|Conv , 128, ELU, stride 2, BN||FC, 1024, 4068, ReLU, BN||FC, 4096, , ReLU|
|Conv , 256, ELU, stride 2, BN||Deconv, , 256, ReLU, stride 2, BN|
|Conv , 512, ELU, stride 2, BN||Deconv, , 128, ReLU, stride 2, BN|
|FC, 4608, 1024, ELU, BN||Deconv, , 64, ReLU, stride 2, BN|
|FC, 1024, , ELU, BN||Deconv, , 3, ReLU, stride 2, BN|
|Batch Size ()||64|
|Latent Space Dimension ()||32|
|Number of Feature Attributes ()||2|
|Context Vector Dimension ()||100|
|KL-Divergence Weight ()||1|
|Augmentation-Consistency Loss Weight ()||0.2|
|Learning Rate ()||1e-4|
As detailed in (Do and Tran, 2020), disentangled representations need to have low mutual information with the base data distribution, since ideally each representation should capture atmost one attribute within the data. The informativeness of a representation w.r.t data is determined by computing the mutual information using the following Equation:
where depicts the encoding function, i.e., and depicts the dataset of all samples, such that . The informativeness metric helps us capture the amount of information encapsulated within each latent chunk with respect to the original image . We compare our DisCont model with unsupervised disentanglement model proposed by (Hu et al., 2018). For training the model in (Hu et al., 2018), we use the following hyperparameter values.
|Latent Space Dimension ()||256||96||512||96|
|Number of Chunks||8||3||8||3|
|Dimension of Each Chunk||32||32||64||32|
|Learning Rate ()||2e-4||2e-4||5e-5||2e-4|
We present attribute transfer visualizations for validating our disentanglement performance. The first two rows depict the sampled batch of images from the test set while the bottom row depicts the images generated by swapping the specified attribute from the first row images with that of the second row images. The style transfer results for Sprites are shown in Fig 9
. The feature swapping results for the dSprites dataset were not consistent probably because of the ambiguity induced by the color transformation in the feature attribute space, when applied to the single channeled images.
In this section, we present the additional latent visualisations of the test set samples with and without the unspecified chunk, i.e., . The latent visualizations for the Cars3D, Sprites and 3DShapes datasets are shown in Fig 13.
In the future, we would like to explore research directions that involve generalizing the set of augmentations used. Further as claimed in (Locatello et al., 2018), we would like to evaluate the performance gains of leveraging our disentanglement model in terms of sample complexity on various downstream tasks.