DisCont: Self-Supervised Visual Attribute Disentanglement using Context Vectors

06/10/2020 ∙ by Sarthak Bhagat, et al. ∙ 13

Disentangling the underlying feature attributes within an image with no prior supervision is a challenging task. Models that can disentangle attributes well provide greater interpretability and control. In this paper, we propose a self-supervised framework DisCont to disentangle multiple attributes by exploiting the structural inductive biases within images. Motivated by the recent surge in contrastive learning paradigms, our model bridges the gap between self-supervised contrastive learning algorithms and unsupervised disentanglement. We evaluate the efficacy of our approach, both qualitatively and quantitatively, on four benchmark datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 9

Code Repositories

DisCont

Code for the paper "DisCont: Self-Supervised Visual Attribute Disentanglement using Context Vectors".


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Real world data like images are generated from several independent and interpretable underlying attributes (Bengio, 2013). It has generally been assumed that successfully disentangling these attributes can lead to robust task-agnostic representations which can enhance efficiency and performance of deep models (Schölkopf et al., 2012; Bengio et al., 2013; Peters et al., 2017). However, recovering these independent factors in a completely unsupervised manner has posed to be a major challenge.

Recent approaches to unsupervised disentanglement have majorly used variants of variational autoencoders

(Higgins et al., 2017; Kim and Mnih, 2018; Chen et al., 2018; Kim et al., 2019) and generative adversarial networks (Chen et al., 2016; Hu et al., 2018; Shukla et al., 2019). Further, such disentangled representations have been utilized for a diverse range of applications including domain adaptation (Cao et al., 2018; Vu and Huang, 2019; Yang et al., 2019), video frame prediction (Denton and Birodkar, 2017; Villegas et al., 2017; Hsieh et al., 2018; Bhagat et al., 2020), recommendation systems (Ma et al., 2019) and multi-task learning (Meng et al., 2019).

Contrary to these approaches, (Locatello et al., 2018) introduced an ‘impossibility result’ which showed that unsupervised disentanglement is impossible without explicit inductive biases on the models and data used. They empirically and theoretically proved that without leveraging the implicit structure induced by these inductive biases within various datasets, disentangled representations cannot be learnt in an unsupervised fashion.

Inspired by this result, we explore methods to exploit the spatial and structural inductive biases prevalent in most visual datasets (Cohen and Shashua, 2016; Ghosh and Gupta, 2019). Recent literature on visual self-supervised representation learning (Misra and van der Maaten, 2019; Tian et al., 2019; He et al., 2019; Arora et al., 2019; Chen et al., 2020) has shown that using methodically grounded data augmentation techniques using contrastive paradigms (Gutmann and Hyvärinen, 2010; van den Oord et al., 2018; Hénaff et al., 2019)

is a promising direction to leverage such inductive biases present in images. The success of these contrastive learning approaches in diverse tasks like reinforcement learning

(Kipf et al., 2019; Srinivas et al., 2020), multi-modal representation learning (Patrick et al., 2020; Udandarao et al., 2020) and information retrieval (Shi et al., 2019; Le and Akoglu, 2019)

further motivates us to apply them to the problem of unsupervised disentangled representation learning.

In this work, we present an intuitive self-supervised framework DisCont to disentangle multiple feature attributes from images by utilising meaningful data augmentation recipes. We hypothesize that applying various stochastic transformations to an image can be used to recover the underlying feature attributes. Consider the example of data possessing two underlying attributes, i.e, color and position. To this image, if we apply a color transformation (eg. color jittering, gray-scale transform), only the underlying color attribute should change but the position attribute should be preserved. Similarly, on applying a translation and/or rotation to the image, the position attribute should vary keeping the color attribute intact.

It is known that there are several intrinsic variations present within different independent attributes (Farhadi et al., 2009; Zhang et al., 2019). To aptly capture these variations, we introduce ‘Attribute Context Vectors’ (refer Section 2.2.2). We posit that by constructing attribute-specific context vectors that learn to capture the entire variability within that attribute, we can learn richer and more robust representations.

Our major contributions in this work can be summarised as follows:

  • We propose a self-supervised method DisCont to simultaneously disentangle multiple underlying visual attributes by effectively introducing inductive biases in images via data augmentations.

  • We highlight the utility of leveraging composite stochastic transformations for learning richer disentangled representations.

  • We present the idea of ‘Attribute Context Vectors’ to capture and utilize intra-attribute variations in an extensive manner.

  • We impose an attribute clustering objective that is commonly used in distance metric learning literature, and show that it further promotes attribute disentanglement.

The rest of the paper is organized as follows: Section 2 presents our proposed self-supervised attribute disentanglement framework, Section 3 provides empirical verification for our hypotheses using qualitative and quantitative evaluations, and Section 4 concludes the paper and provides directions for future research.

2 Methodology

In this section, we start off by introducing the notations we follow, move on to describing the network architecture and the loss functions employed, and finally, illustrate the training procedure and optimization strategy adopted.

2.1 Preliminaries

Assume we have a dataset containing images, where , consisting of labeled attributes . These images can be thought of as being generated by a small set of explicable feature attributes. For example, consider the CelebA dataset (Liu et al., 2015) containing face images. A few of the underlying attributes are hair color, eyeglasses, bangs, moustache, etc.

From (Do and Tran, 2020), we define a latent representation chunk as ‘fully disentangled’ w.r.t a ground truth factor if is fully separable from and is fully interpretable w.r.t . Therefore, we can say for such a representation, the following conditions hold:

(1)

and

(2)

where, denotes the mutual information between two latent chunks while denotes the entropy of the latent chunk w.r.t attribute. To recover these feature attributes in a self-supervised manner while ensuring attribute disentanglement, we propose an encoder-decoder network (refer Fig 1) that makes use of contrastive learning paradigms.

Figure 1: Overview of our architecture DisCont. Given a batch of images , we generate an augmented batch by sampling a set of stochastic transformations. We then encode and to extract their latent representations , and , respectively. is then used to construct ‘Attribute Context Vectors’ corresponding to each feature attribute. The context vectors and the latent representations are then used to optimize our disentanglement objective.

2.2 Model Description

To enforce the learning of rich and disentangled attributes, we propose to view the underlying latent space as two disjoint subspaces.

  • : denotes the feature attribute space containing the disentangled and interpretable attributes, where and denote the dimensionality of the space and the number of feature attributes respectively.

  • : denotes the unspecified attribute space containing background and miscellaneous attributes, where is the dimensionality of the space. We enforce a prior over this space following (Mathieu et al., 2016) and (Jha et al., 2018).

Assume that we have an invertible encoding function parameterized by , then each image can be encoded in the following way:

where, we can index to recover the independent feature attributes i.e. . To project the latent encodings back to image space, we make use of a decoding function parameterized by . Therefore, we can get image reconstructions using the separate latent encodings.

2.2.1 Composite Data Augmentations

Following our initial hypothesis of recovering latent attributes using stochastic transformations, we formulate a mask-based compositional augmentation approach that leverages positive and negative transformations.

Assume that we have two sets of stochastic transformations and that can augment an image into a correlated form i.e. . denotes the positive set of transformations that when applied to an image should not change any of the underlying attributes, whereas denotes the negative set of transformations that when applied to an image should change a single underlying attribute, i.e., when is applied to an image , it should lead to a change only in the attribute and all other attributes should be preserved.

For every batch of images , we sample a subset of transformations to apply compositionally to and retrieve an augmented batch and a mask vector . This is further detailed in Appendix A and Appendix B.

2.2.2 Attribute Context Vectors

Taking inspiration from (van den Oord et al., 2018), we propose attribute context vectors . A context vector is formed from each of the individual feature attributes through a non-linear projection. The idea is to encapsulate the batch invariant identity and variability of the attribute in . Hence, each individual context vector should capture an independent disentangled feature space of the individual factors of variation. Assume a non-linear mapping function , where denotes the dimensionality of each context vector and denotes the size of a sampled mini-batch. We construct context vectors by aggregating all the feature attributes locally within the sampled minibatch.

2.3 Loss Functions

We describe the losses that we use to enforce disentanglement and interpretability within our feature attribute space.
We have a reconstruction penalty term to ensure that and encode enough information in order to produce high-fidelity image reconstructions.

(3)

To ensure that the unspecified attribute acts as a generative latent code that encodes the arbitrary features within an image, we enforce the ELBO KL objective (Kingma and Welling, 2013) on .

(4)

where , and are the densities of arbitrary continuous distributions and respectively.

We additionally enforce clustering using center loss in the feature attribute space by differentiating inter-attribute features. This metric learning training strategy (Wen et al., 2016) promotes accumulation of feature attributes into distantly-placed clusters by providing additional self-supervision in the form of pseudo-labels obtained from the context vectors.

The center loss enforces the increment of inter-attribute distances, furthermore, diminishing the intra-attribute distances. We make use of the function to project the feature attributes into the context vector space and then apply the center loss given by Equation 5.

(5)

where, context vectors function as centers for the clusters corresponding to the attribute.

We also ensure that the context vectors do not deviate largely across gradient updates, by imposing a gradient descent update on the context vectors.

Finally, to ensure augmentation-specific consistency within the feature attributes, we propose a feature-wise regularization penalty . We first generate the augmented batch and mask using Algorithm 1. We then encode to produce augmented feature attributes and unspecified attribute in the following way:

Now, since we want to ensure that a specific negative augmentation enforces a change in only the feature attribute , we encourage the representations of and to be close in representation space. Therefore, this enforces each feature attribute to be invariant to specific augmentation perturbations. Further, since the background attributes of images should be preserved irrespective of the augmentations applied, we also encourage the proximity of and . This augmentation-consistency loss is defined as:

(6)

The overall loss of our model is a weighted sum of these constituent losses.

(7)

where, weights and

are treated as hyperparameters. Implementation details can be found in

Appendix D.

3 Experiments

We employ a diverse set of four datasets to evaluate the efficacy of our approach 111Code is available at https://github.com/sarthak268/DisCont, the details of which can be found in Appendix C.

3.1 Quantitative Results

Informativeness. To ensure a robust evaluation of our disentanglement models using unsupervised metrics, we compare informativeness scores (defined in (Do and Tran, 2020)) of our model’s latent chunks in Fig 2 with the state-of-the-art unsupervised disentanglement model presented in (Hu et al., 2018)

(which we refer as MIX). A lower value of informativeness score suggests a better disentangled latent representation. Further details about the evaluation metric can be found in

Appendix E.

Figure 2: Informativeness scores for DisCont and MIX across datasets

3.2 Qualitative Results

Latent Visualization. We present latent visualisation for the test set samples with and without the unspecified chunk, i.e. . The separation between these latent chunks in the projected space manifests the decorrelation and independence between them. Here, we present latent visualisations for the dsprites dataset, the same for the other datasets can be found in Appendix G.

Figure 3: Latent space visualizations for dsprites dataset with all feature chunks (Left), i.e. and specified feature chunks (Right), i.e. . Each color depicts samples from different latent chunks, purple color representing the unspecified chunk, i.e. .

Attribute Transfer. We present attribute transfer visualizations to construe the quality of disentanglement. The images in the first two rows in each grid are randomly sampled from the test-set. The bottom row images are formed by swapping one specified attribute from the top row image with the corresponding attribute chunk in the second row image, keep all other attributes fixed. This allows us to quantify the purity of attribute-wise information captured by each latent chunk. We present these results for Cars3D and 3DShapes here, while the others in Appendix F.

(a) Cars3D; Specified Attribute: Color
(b) 3DShapes; Specified Attribute: Orientation
Figure 6: Attribute transfer results obtained by swapping the specified chunk.

4 Conclusion

In this paper, we propose a self-supervised attribute disentanglement framework DisCont which leverages specific data augmentations to exploit the spatial inductive biases present in images. We also propose ‘Attribute Context Vectors’ that encapsulate the intra-attribute variations. Our results show that such a framework can be readily used to recover semantically meaningful attributes independently.

Acknowledgement

We would like to thank Dr. Saket Anand (IIIT Delhi) for his guidance in formulating the initial problem statement, and valuable comments and feedback on this paper.

References

  • S. Arora, H. Khandeparkar, M. Khodak, O. Plevrakis, and N. Saunshi (2019) A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229. Cited by: §1.
  • Y. Bengio, A. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §1.
  • Y. Bengio (2013) Deep learning of representations: looking forward. In International Conference on Statistical Language and Speech Processing, pp. 1–37. Cited by: §1.
  • S. Bhagat, S. Uppal, V. T. Yin, and N. Lim (2020) Disentangling representations using gaussian processes in variational autoencoders for video prediction. ArXiv abs/2001.02408. Cited by: §1.
  • C. Burgess and H. Kim (2018) 3D shapes dataset. Note: https://github.com/deepmind/3dshapes-dataset/ Cited by: 3rd item.
  • J. Cao, O. Katzir, P. Jiang, D. Lischinski, D. Cohen-Or, C. Tu, and Y. Li (2018) DiDA: disentangled synthesis for domain adaptation. ArXiv abs/1805.08019. Cited by: §1.
  • T. Q. Chen, X. Li, R. B. Grosse, and D. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. In NeurIPS, Cited by: §1.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. External Links: 2002.05709 Cited by: §1.
  • X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §1.
  • N. Cohen and A. Shashua (2016) Inductive bias of deep convolutional networks through pooling geometry. arXiv preprint arXiv:1605.06743. Cited by: §1.
  • E. L. Denton and V. Birodkar (2017) Unsupervised learning of disentangled representations from video. In NIPS, Cited by: §1.
  • K. Do and T. Tran (2020) Theory and evaluation metrics for learning disentangled representations. ArXiv abs/1908.09961. Cited by: Appendix E, §2.1, §3.1.
  • A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth (2009) Describing objects by their attributes. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Vol. , pp. 1778–1785. Cited by: §1.
  • R. Ghosh and A. K. Gupta (2019)

    Investigating convolutional neural networks using spatial orderness

    .
    In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §1.
  • M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    ,
    pp. 297–304. Cited by: §1.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §1.
  • O. J. Hénaff, A. Srinivas, J. D. Fauw, A. Razavi, C. Doersch, S. M. A. Eslami, and A. van den Oord (2019) Data-efficient image recognition with contrastive predictive coding. External Links: 1905.09272 Cited by: §1.
  • I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework. In ICLR, Cited by: §1.
  • J. Hsieh, B. Liu, D. Huang, L. Fei-Fei, and J. C. Niebles (2018) Learning to decompose and disentangle representations for video prediction. In NeurIPS, Cited by: §1.
  • Q. Hu, A. Szabó, T. Portenier, M. Zwicker, and P. Favaro (2018) Disentangling factors of variation by mixing them. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3399–3407. Cited by: Table 4, Appendix E, §1, §3.1.
  • A. H. Jha, S. Anand, M. K. Singh, and V. S. R. Veeravasarapu (2018) Disentangling factors of variation with cycle-consistent variational auto-encoders. ArXiv abs/1804.10469. Cited by: 2nd item.
  • H. Kim and A. Mnih (2018) Disentangling by factorising. ArXiv abs/1802.05983. Cited by: §1.
  • M. Kim, Y. Wang, P. Sahu, and V. Pavlovic (2019) Relevance factor vae: learning and identifying disentangled factors. ArXiv abs/1902.01568. Cited by: §1.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.3.
  • T. Kipf, E. van der Pol, and M. Welling (2019) Contrastive learning of structured world models. arXiv preprint arXiv:1911.12247. Cited by: §1.
  • T. Le and L. Akoglu (2019) ContraVis: contrastive and visual topic modeling for comparing document collections. In The World Wide Web Conference, pp. 928–938. Cited by: §1.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §2.1.
  • F. Locatello, S. Bauer, M. Lucic, G. Rätsch, S. Gelly, B. Schölkopf, and O. Bachem (2018) Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359. Cited by: Appendix H, §1.
  • J. Ma, C. Zhou, P. Cui, H. Yang, and W. Zhu (2019) Learning disentangled representations for recommendation. In NeurIPS, Cited by: §1.
  • M. Mathieu, J. J. Zhao, P. Sprechmann, A. Ramesh, and Y. LeCun (2016) Disentangling factors of variation in deep representation using adversarial training. ArXiv abs/1611.03383. Cited by: 2nd item.
  • L. Matthey, I. Higgins, D. Hassabis, and A. Lerchner (2017) DSprites: disentanglement testing sprites dataset. Note: https://github.com/deepmind/dsprites-dataset/ Cited by: 4th item.
  • Q. Meng, N. Pawlowski, D. Rueckert, and B. Kainz (2019) Representation disentanglement for multi-task learning with application to fetal ultrasound. In SUSI/PIPPI@MICCAI, Cited by: §1.
  • I. Misra and L. van der Maaten (2019)

    Self-supervised learning of pretext-invariant representations

    .
    arXiv preprint arXiv:1912.01991. Cited by: §1.
  • M. Patrick, Y. M. Asano, R. Fong, J. F. Henriques, G. Zweig, and A. Vedaldi (2020) Multi-modal self-supervision from generalized data transformations. arXiv preprint arXiv:2003.04298. Cited by: §1.
  • J. Peters, D. Janzing, and B. Schölkopf (2017) Elements of causal inference: foundations and learning algorithms. MIT press. Cited by: §1.
  • S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee (2015) Deep visual analogy-making. In NIPS, Cited by: 1st item, 2nd item.
  • B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij (2012) On causal and anticausal learning. arXiv preprint arXiv:1206.6471. Cited by: §1.
  • J. Shi, C. Liang, L. Hou, J. Li, Z. Liu, and H. Zhang (2019)

    Deepchannel: salience estimation by contrastive learning for extractive document summarization

    .
    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6999–7006. Cited by: §1.
  • A. Shukla, S. Bhagat, S. Uppal, S. Anand, and P. K. Turaga (2019) Product of orthogonal spheres parameterization for disentangled representation learning. In BMVC, Cited by: §1.
  • A. Srinivas, M. Laskin, and P. Abbeel (2020) Curl: contrastive unsupervised representations for reinforcement learning. arXiv preprint arXiv:2004.04136. Cited by: §1.
  • Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. External Links: 1906.05849 Cited by: §1.
  • V. Udandarao, A. Maiti, D. Srivatsav, S. R. Vyalla, Y. Yin, and R. R. Shah (2020) COBRA: contrastive bi-modal representation algorithm. arXiv preprint arXiv:2005.03687. Cited by: §1.
  • A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. External Links: 1807.03748 Cited by: §1, §2.2.2.
  • R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee (2017) Decomposing motion and content for natural video sequence prediction. ArXiv abs/1706.08033. Cited by: §1.
  • H. T. Vu and C. Huang (2019) Domain adaptation meets disentangled representation learning and style transfer. 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), pp. 2998–3005. Cited by: §1.
  • Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016)

    A discriminative feature learning approach for deep face recognition

    .
    In European conference on computer vision, pp. 499–515. Cited by: §2.3.
  • J. Yang, N. C. Dvornek, F. Zhang, J. Chapiro, M. Lin, and J. S. Duncan (2019) Unsupervised domain adaptation via disentangled representations: application to cross-modality liver segmentation. MICCAI 11765, pp. 255–263. Cited by: §1.
  • J. Zhang, Y. Huang, Y. Li, W. Zhao, and L. Zhang (2019) Multi-attribute transfer via disentangled representation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 9195–9202. Cited by: §1.

Appendix A Mask and Augmented Batch Generation Algorithm

This section describes the generation of the mask and the augmented batch for the computation of the augmentation-consistency loss (refer Equation 6). The entire algorithm is detailed below:

Algorithm 1 Mask and Augmented Batch generation   Input: A batch of images , the set of positive transformations , the set of negative transformations , number of feature attributes   Output: The augmented batch , the mask   Initialize ,   for  to  do            if  then                        end if   end for   for  to  do            if  then               end if   end for   return

Appendix B Augmentations

The set of augmentations and their sampling range of parameters that we use in this work are detailed in the table below:

Positive Augmentations
Type Parameter Range
Gaussian Noise [0.5, 1, 2, 5]
Gaussian Smoothing [0.1, 0.2, 0.5, 1]
Negative Augmentations
Type Parameter Range
Grayscale Transform
Flipping Orientation [Horizontal, Vertical]
Rotation [90°, 180°, 270°]
Crop & Resize
Cutout Length [5, 10, 15, 20]
Table 1: Set of augmentations used for training DisCont.

Appendix C Dataset Description

We use the following datasets to evaluate our model performance:

  • Sprites (Reed et al., 2015) is a dataset of 480 unique animated caricatures (sprites) with essentially 6 factors of variations namely gender, hair type, body type, armor type, arm type and greaves type. The entire dataset consists of 143,040 images with 320, 80 and 80 characters in train, test and validation sets respectively.

  • Cars3D (Reed et al., 2015) consists of a total of 17,568 images of synthetic cars with elevation, azimuth and object type varying in each image.

  • 3DShapes (Burgess and Kim, 2018) is a dataset of 3D shapes generated using 6 independent factors of variations namely floor color, wall color, object color, scale, shape and orientation. It consists of a total of 480,000 images of 4 discrete 3D shapes.

  • dSprites (Matthey et al., 2017) is a dataset consisting of 2D square, ellipse, and hearts with varying color, scale, rotation, x and y positions. In total, it consists of 737,280 gray-scale images.

Appendix D Implementation, Training and Hyperparameter Details

We use a single experimental setup across all our experiments. We implement all our models on an Nvidia GTX 1080 GPU using the PyTorch framework. Architectural details are provided in

Table 2. The training hyperparameters used are listed in Table 3.

Encoder Decoder Context Network
Input: Input: Input:
Conv

, 64, ELU, stride 2, BN

FC,

, 1024, ReLU, BN

FC, , 4096, ReLU
Conv , 128, ELU, stride 2, BN FC, 1024, 4068, ReLU, BN FC, 4096, , ReLU
Conv , 256, ELU, stride 2, BN Deconv, , 256, ReLU, stride 2, BN
Conv , 512, ELU, stride 2, BN Deconv, , 128, ReLU, stride 2, BN
FC, 4608, 1024, ELU, BN Deconv, , 64, ReLU, stride 2, BN
FC, 1024, , ELU, BN Deconv, , 3, ReLU, stride 2, BN
Table 2: Architectures for encoder , decoder and context network for all experiments. Here, Conv denotes 2D convolution layer, Deconv denotes 2D transposed convolution layer, FC denotes fully connected layer, BN denotes batch normalisation layer.
Parameter Value
Batch Size () 64
Latent Space Dimension () 32
Number of Feature Attributes () 2
Context Vector Dimension () 100
KL-Divergence Weight () 1
Augmentation-Consistency Loss Weight () 0.2
Optimizer Adam
Learning Rate () 1e-4
Adam: 0.5
Adam: 0.999

Training Epochs

250
Table 3: Training hyperparameters for DisCont common across all datasets.

Appendix E Informativeness Evaluation Metric

As detailed in (Do and Tran, 2020), disentangled representations need to have low mutual information with the base data distribution, since ideally each representation should capture atmost one attribute within the data. The informativeness of a representation w.r.t data is determined by computing the mutual information using the following Equation:

(8)

where depicts the encoding function, i.e., and depicts the dataset of all samples, such that . The informativeness metric helps us capture the amount of information encapsulated within each latent chunk with respect to the original image . We compare our DisCont model with unsupervised disentanglement model proposed by (Hu et al., 2018). For training the model in (Hu et al., 2018), we use the following hyperparameter values.

Hyperparameter Sprites Cars3D 3DShapes dSprites
Latent Space Dimension () 256 96 512 96
Number of Chunks 8 3 8 3
Dimension of Each Chunk 32 32 64 32
Optimizer Adam Adam Adam Adam
Learning Rate () 2e-4 2e-4 5e-5 2e-4
Adam: 0.5 0.5 0.5 0.5
Adam: 0.99 0.99 0.99 0.99
Training Epochs 100 200 200 150
Table 4: Training hyperparameters for the model in (Hu et al., 2018)

Appendix F Attribute Transfer

We present attribute transfer visualizations for validating our disentanglement performance. The first two rows depict the sampled batch of images from the test set while the bottom row depicts the images generated by swapping the specified attribute from the first row images with that of the second row images. The style transfer results for Sprites are shown in Fig 9

. The feature swapping results for the dSprites dataset were not consistent probably because of the ambiguity induced by the color transformation in the feature attribute space, when applied to the single channeled images.

(a) Specified Attribute: Hair Color
(b) Specified Attribute: Pose
Figure 9: Attribute transfer results for DisCont obtained by swapping the specified chunk.

Appendix G Latent Visualisation

In this section, we present the additional latent visualisations of the test set samples with and without the unspecified chunk, i.e., . The latent visualizations for the Cars3D, Sprites and 3DShapes datasets are shown in Fig 13.

(a) Cars3D
(b) 3DShapes
(c) Sprites
Figure 13: Latent space visualizations for Cars3D, 3DShapes and Sprites datasets with all feature chunks (Left) i.e., and specified feature chunks (Right) i.e., . Each color depicts samples from different latent chunks, green color representing the unspecified chunk, i.e., for the Cars3D and Sprites dataset and purple for the 3DShapes dataset.

Appendix H Future Work

In the future, we would like to explore research directions that involve generalizing the set of augmentations used. Further as claimed in (Locatello et al., 2018), we would like to evaluate the performance gains of leveraging our disentanglement model in terms of sample complexity on various downstream tasks.