Log In Sign Up

Self-Supervised Learning Disentangled Group Representation as Feature

A good visual representation is an inference map from observations (images) to features (vectors) that faithfully reflects the hidden modularized generative factors (semantics). In this paper, we formulate the notion of "good" representation from a group-theoretic view using Higgins' definition of disentangled representation, and show that existing Self-Supervised Learning (SSL) only disentangles simple augmentation features such as rotation and colorization, thus unable to modularize the remaining semantics. To break the limitation, we propose an iterative SSL algorithm: Iterative Partition-based Invariant Risk Minimization (IP-IRM), which successfully grounds the abstract semantics and the group acting on them into concrete contrastive learning. At each iteration, IP-IRM first partitions the training samples into two subsets that correspond to an entangled group element. Then, it minimizes a subset-invariant contrastive loss, where the invariance guarantees to disentangle the group element. We prove that IP-IRM converges to a fully disentangled representation and show its effectiveness on various benchmarks. Codes are available at


page 2

page 7

page 9

page 10

page 27

page 28


Pose-disentangled Contrastive Learning for Self-supervised Facial Representation

Self-supervised facial representation has recently attracted increasing ...

Augmentation-Free Self-Supervised Learning on Graphs

Inspired by the recent success of self-supervised methods applied on ima...

Disentangled Contrastive Learning for Learning Robust Textual Representations

Although the self-supervised pre-training of transformer models has resu...

Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning

In view of its power in extracting feature representation, contrastive s...

Equivariant Representation Learning via Class-Pose Decomposition

We introduce a general method for learning representations that are equi...

An Investigation into Whitening Loss for Self-supervised Learning

A desirable objective in self-supervised learning (SSL) is to avoid feat...

1 Introduction

Figure 1: Disentangled representation is an equivariant map between the semantic space and the vector space , which is decomposed into “color” and “digit”.

Deep learning is all about learning feature representations [4]. Compared to the conventional end-to-end supervised learning, Self-Supervised Learning (SSL) first learns a generic feature representation (e.g., a network backbone) by training with unsupervised pretext tasks such as the prevailing contrastive objective [38, 17], and then the above stage-1 feature is expected to serve various stage-2 applications with proper fine-tuning. SSL for visual representation is so fascinating that it is the first time that we can obtain “good” visual features for free, just like the trending pre-training in NLP community [27, 9]. However, most SSL works only care how much stage-2 performance an SSL feature can improve, but overlook what feature SSL is learning, why it can be learned, what cannot be learned, what the gap between SSL and Supervised Learning (SL) is, and when SSL can surpass SL?

The crux of answering those questions is to formally understand what a feature representation is and what a good one is. We postulate the classic world model of visual generation and feature representation [1, 71] as in Figure 1. Let be a set of (unseen) semantics, e.g., attributes such as “digit” and “color”. There is a set of independent and causal mechanisms [68] , generating images from semantics, e.g., writing a digit “0” when thinking of “0” [76]. A visual representation is the inference process that maps image pixels to vector space features, e.g.

, a neural network. We define

semantic representation as the functional composition

. In this paper, we are only interested in the parameterization of the inference process for feature extraction, but not the generation process,

i.e., we assume , , such that is fixed as the observation of each image sample. Therefore, we consider semantic and visual representations the same as feature representation, or simply representation, and we slightly abuse , i.e., and share the same trainable parameters. We call the vector as feature, where .

We propose to use Higgins’ definition of disentangled representation [42] to define what is “good”.

Definition 1. (Disentangled Representation) Let be the group acting on , i.e., transforms , e.g., a “turn green” group element changing the semantic from “red” to “green”. Suppose there is a direct product decomposition111Note that can also denote a cyclic subgroup such as rotation , or a countable one but treated as cyclic such as translation and color . and , where acts on respectively. A feature representation is disentangled if there exists a group acting on such that:

  1. [leftmargin=+.2in]

  2. Equivariant: , e.g., the feature of the changed semantic: “red” to “green” in , is equivalent to directly change the color vector in from “red” to “green”.

  3. Decomposable: there is a decomposition , such that each is fixed by the action of all and affected only by , e.g., changing the “color” semantic in does not affect the “digit” vector in .

Figure 2: (a) The heat map visualizes feature dimensions related to augmentations (aug. related) and unrelated to augmentations (aug. unrelated), whose respective classification accuracy is shown in the bar chart below. Dashed bar denotes the accuracy using full feature dimensions. Experiment was performed on STL10 [23] with representation learnt with SimCLR [17] and our IP-IRM. (b) Visualization of CNN activations [79] of 4 filters on layer 29 and 18 of VGG [77] trained on ImageNet100 [83]. The filters were chosen by first clustering the aug. unrelated filters with -means () and then selecting the filters corresponding to the cluster centers.

Compared to the previous definition of feature representation which is a static mapping, the disentangled representation in Definition 1 is dynamic as it explicitly incorporate group representation [37], which is a homomorphism from group to group actions on a space, e.g., , and it is common to use the feature space as a shorthand—this is where our title stands.

Definition 1 defines “good” features in the common views: 1) Robustness: a good feature should be invariant to the change of environmental semantics, such as external interventions [47, 89] or domain shifts [34]. By the above definition, a change is always retained in a subspace

, while others are not affected. Hence, the subsequent classifier will focus on the invariant features and ignore the ever-changing

. 2) Zero-shot Generalization: even if a new combination of semantics is unseen in training, each semantic has been learned as features. So, the metrics of each trained by seen samples remain valid for unseen samples [97].

Are the existing SSL methods learning disentangled representations? No. We show in Section 4 that they can only disentangle representations according to the hand-crafted augmentations, e.g., color jitter and rotation. For example, in Figure 2 (a), even if we only use the augmentation-related feature, the classification accuracy of a standard SSL (SimCLR [17]) does not lose much as compared to the full feature use. Figure 2 (b) visualizes that the CNN features in each layer are indeed entangled (e.g., tyre, motor, and background in the motorcycle image). In contrast, our approach IP-IRM, to be introduced below, disentangles more useful features beyond augmentations.

In this paper, we propose Iterative Partition-based Invariant Risk Minimization (IP-IRM [aip:m]) that guarantees to learn disentangled representations in an SSL fashion. We present the algorithm in Section 3, followed by the theoretical justifications in Section 4. In a nutshell, at each iteration, IP-IRM first partitions the training data into two disjoint subsets, each of which is an orbit of the already disentangled group, and the cross-orbit group corresponds to an entangled group element . Then, we adopt the Invariant Risk Minimization (IRM) [2] to implement a partition-based SSL, which disentangles the representation w.r.t. . Iterating the above two steps eventually converges to a fully disentangled representation w.r.t. . In Section 5, we show promising experimental results on various feature disentanglement and SSL benchmarks.

2 Related Work

Self-Supervised Learning. SSL aims to learn representations from unlabeled data with hand-crafted pretext tasks [29, 65, 35]. Recently, Contrastive learning [67, 63, 40, 82, 17] prevails in most state-of-the-art methods. The key is to map positive samples closer, while pushing apart negative ones in the feature space. Specifically, the positive samples are from the augmented views [84, 3, 96, 44] of each instance and the negative ones are other instances. Along this direction, follow-up methods are mainly four-fold: 1) Memory-bank [92, 63, 38, 19]: storing the prototypes of all the instances computed previously into a memory bank to benefit from a large number of negative samples. 2) Using siamese network [8] to avoid representation collapse [36, 20, 85]. 3) Assigning clusters to samples to integrate inter-instance similarity into contrastive learning [12, 13, 14, 90, 58]. 4) Seeking hard negative samples with adversarial training or better sampling strategies [75, 21, 46, 50]. In contrast, our proposed IP-IRM jumps out of the above frame and introduces the disentangled representation into SSL with group theory to show the limitations of existing SSL and how to break through them.

Disentangled Representation. This notion dates back to [5], and henceforward becomes a high-level goal of separating the factors of variations in the data [86, 81, 88, 60]. Several works aim to provide a more precise description [28, 30, 74] by adopting an information-theoretic view [18, 28] and measuring the properties of a disentangled representation explicitly [30, 74]. We adopt the recent group-theoretic definition from Higgins et al. [42], which not only unifies the existing, but also resolves the previous controversial points [80, 61]. Although supervised learning of disentangled representation is a well-studied field [103, 45, 11, 72, 51], unsupervised disentanglement based on GAN [18, 66, 59, 73] or VAE [41, 16, 102, 52] is still believed to be theoretically challenging [61]. Thanks to the Higgins’ definition, we prove that the proposed IP-IRM converges with full-semantic disentanglement using group representation theory. Notably, IP-IRM learns a disentangled representation with an inference process, without using generative models as in all the existing unsupervised methods, making IP-IRM applicable even on large-scale datasets.

Group Representation Learning. A group representation has two elements [49, 37]: 1) a homomorphism (e.g., a mapping function) from the group to its group action acting on a vector space, and 2) the vector space. Usually, when there is no ambiguity, we can use either element as the definition. Most existing works focus on learning the first element. They first define the group of interest, such as spherical rotations [24] or image scaling [91, 78], and then learn the parameters of the group actions [25, 48, 70]. In contrast, we focus on the second element; more specifically, we are interested in learning a map between two vector spaces: image pixel space and feature vector space. Our representation learning is flexible because it delays the group action learning to downstream tasks on demand. For example, in a classification task, a classifier can be seen as a group action that is invariant to class-agnostic groups but equivariant to class-specific groups (see Section 4).

3 IP-IRM Algorithm

Notations. Our goal is to learn the feature extractor in a self-supervised fashion. We define a partition matrix that partitions training images into disjoint subsets. if the -th image belongs to the -th subset and

otherwise. Suppose we have a pretext task loss function

defined on the samples in the -th subset, where is a “dummy” parameter used to evaluate the invariance of the SSL loss across the subsets (later discussed in Step 1). For example, can be defined as:


where , and is the augmented view feature of .

Input. training images. Randomly initialized . A partition matrix initialized such that the first column of is 1, i.e., all samples belong to the first subset. Set .

Output. Disentangled feature extractor .

Step 1 [Update ]. We update by:


where is a hyper-parameter. The second term delineates how far the contrast in one subset is from a constant baseline . The minimization of both of them encourages in different subsets close to the same baseline, i.e., invariance across the subsets. See IRM [2] for more details. In particular, the first iteration corresponds to the standard SSL with in Eq. (1) containing all training images.

Step 2 [Update ]. We fix and find a new partition by


where is a hyper-parameter. In practice, we use a continuous partition matrix in during optimization and then threshold it to .

We update and iterate the above two steps until convergence.

4 Justification

Recall that IP-IRM uses training sample partitions to learn the disentangled representations w.r.t. . As we have a -equivariant feature map between the sample space and feature space (the equivariance is later guaranteed by Lemma 1), we slightly abuse the notation by using to denote both spaces. Also, we assume that is a homogeneous space of , i.e., any sample can be transited from another sample by a group action . Intuitively, is all you need to describe the diversity of the training set. It is worth noting that is any group element in while is a Cartesian “building block” of , e.g., can be decomposed by .

We show that partition and group are tightly connected by the concept of orbit. Given a sample , its group orbit w.r.t. is a sample set . As shown in Figure 3 (a), if is a set of attributes shared by classes, e.g., “color” and “pose ”, the orbit is the sample set of the class of ; in Figure 3 (b), if denotes augmentations, the orbit is the set of augmented images. In particular, we can see that the disjoint orbits in Figure 3 naturally form a partition. Formally, we have the following definition:

Definition 2. (Orbit & Partition [49]) Given a subgroup , it partitions into the disjoint subsets: , where is the number of cosets , and the cosets form a factor group111Given with , then is a normal subgroup of , and is isomorphic to  [49]. We write with slight abuse of notation. . In particular, can be considered as a sample of the -th class, transited from any sample .

Interestingly, the partition offers a new perspective for the training data format in Supervised Learning (SL) and Self-Supervised Learning (SSL). In SL, as shown in Figure 3 (a), the data is labeled with classes, each of which is an orbit with training samples, whose variations are depicted by the class-sharing attribute group . The cross-orbit group action, e.g., , can be read as “turn into a dog” and such “turn” is always valid due to the assumption that is a homogeneous space of . In SSL, as shown in Figure 3 (b), each training sample is augmented by the group . So, consists of all the augmentations of the -th sample, where the cross-orbit group action can be read as “turn into the -th sample”.

Thanks to the orbit and partition view of training data, we are ready to revisit model generalization in a group-theoretic view by using invariance and equivariance—the two sides of the coin, whose name is disentanglement. For SL, we expect that a good feature is disentangled into a class-agnostic part and a class-specific part: the former (latter) is invariant (equivariant) to —cross-orbit traverse, but equivariant (invariant) to —in-orbit traverse. By using such feature, a model can generalize to diverse testing samples (limited to variations) by only keeping the class-specific feature. Formally, we prove that we can achieve such disentanglement by contrastive learning:

Figure 3: Each orbit only illustrates with 5 samples. (a) Orbit: the training samples of a class; in-orbit actions: intra-class variations (:standing, :blacken, :jumping, :whiten, :running); cross-orbit actions: inter-class variations. (b) Orbit: a sample and its augmented samples; in-orbit actions: augmentations (:clock-wise rotation, :color jitter, :gray scale, :counterclockwise rotation, :color); cross-orbit actions: inter-sample variations. (c) Step 2 in IP-IRM discovers 2 orbits, where the cross-orbit action corresponds to a group action “green to red” or “red to green”, which is yet disentangled.
Lemma 1.

(Disentanglement by Contrastive Learning) Training loss disentangles w.r.t. , where and are from the same orbit.

We can draw the following interesting corollaries from Lemma 1 (details in Appendix):

  1. [leftmargin=+.2in]

  2. If we use all the samples in the denominator of the loss, we can approximate to -equivariant features given limited training samples. This is because the loss minimization guarantees , i.e., any pair corresponds to a group action.

  3. Conventional cross-entropy loss in SL is a special case, if we define as classifier weights. So, SL does not guarantee the disentanglement of , which causes generalization error if the class domain of downstream task is different from SL pre-training, e.g., a subset of .

  4. In contrastive learning based SSL, (recall Figure 2), and the number of augmentations is generally much smaller compared to the class-wise sample diversity in SL. This enables the SL model to generalize to more diverse testing samples () by filtering out the class-agnostic features (e.g., background) and focusing on the class-specific ones (e.g., foreground), which explains why SSL is worse than SL in downstream classification.

  5. In SL, if the number of training samples per orbit is not enough, i.e., smaller than , the disentanglement between and cannot be guaranteed, such as the challenges in few-shot learning [98]. Fortunately, in SSL, the number is enough as we always include all the augmented samples in training. Moreover, we conjecture that only contains simple cyclic group elements such as rotation and colorization, which are easier for representation learning.

Lemma 1 does not guarantee the decomposability of each . Nonetheless, the downstream model can still generalize by keeping the class-specific features affected by . Therefore, the key to fill the gap or even let SSL surpass SL is to achieve the full disentanglement of .

Theorem 1.

The representation is fully disentangled w.r.t. if and only if , the contrastive loss in Eq. (1) is invariant to the 2 orbits of partition , where .

The maximization in Step 2 is based on the contra-position of the sufficient condition of Theorem 1. Denote the currently disentangled group as (initially ). If we can find a partition to maximize the loss in Eq. (3), i.e., SSL loss is variant across the orbits, then such that the representation of is entangled, i.e., . Figure 3 (c) illustrates a discovered partition about color. The minimization in Step 1 is based on the necessary condition of Theorem 1. Based on the discovered , if we minimize Eq. (2), we can further disentangle and update . Overall, IP-IRM converges as is finite. Note that an improved contrastive objective [94] can further disentangle each and achieve full disentanglement w.r.t. .

5 Experiments

5.1 Unsupervised Disentanglement

Datasets. We used two datasets. CMNIST [2] has 60,000 digit images with semantic labels of digits (0-9) and colors (red and green). These images differ in other semantics (e.g., slant and font) that are not labeled. Moreover, there is a strong correlation between digits and colors (most 0-4 in red and 5-9 in green), increasing the difficulty to disentangle them. Shapes3D [52] contains 480,000 images with 6 labelled semantics, i.e., size, type, azimuth, as well as floor, wall and object color. Note that we only considered the first three semantics for evaluation, as the standard augmentations in SSL will contaminate any color-related semantics.

Settings. We adopted 6 representative disentanglement metrics: Disentangle Metric for Informativeness (DCI) [30], Interventional Robustness Score (IRS) [81], Explicitness Score (EXP) [74], Modularity Score (MOD) [74] and the accuracy of predicting the ground-truth semantic labels by two classification models called logistic regression (LR) and gradient boosted trees (GBT) [61]. Specifically, DCI and EXP measure the explicitness, i.e.

, the values of semantics can be decoded from the feature using a linear transformation. MOD and IRS measure the modularity,

i.e., whether each feature dimension is equivariant to the shift of a single semantic. See Appendix for more detailed formula of the metrics. In evaluation, we trained CNN-based feature extractor backbones with comparable number of parameters for all the baselines and our IP-IRM. The full implementation details are in Appendix.

CMNIST VAE [53] 0.948 0.004 - 0.664






0.948 0.004 0.849


-VAE [43] 0.945


- 0.705










-AnnealVAE [10] 0.911


- 0.790










-TCVAE [16] 0.914


- 0.864










Factor-VAE [52] 0.916


- 0.893 0.056 0.947








SimCLR [17] 0.882


- 0.767










IP-IRM (Ours) 0.917 0.008 - 0.785 0.031 0.990 0.002 0.921 0.009 0.916 0.007 0.906 0.011
Shapes3D VAE [53] 0.351




0.820 0.015 0.802








-VAE [43] 0.369














-AnnealVAE [10] 0.327














-TCVAE [16] 0.470














Factor-VAE [52] 0.340














SimCLR [17] 0.535


0.439 0.030 0.678










IP-IRM (Ours) 0.565 0.023 0.420 0.014 0.766 0.036 0.959 0.007 0.757 0.025 0.565 0.023 0.672 0.017
Table 1: Results on disentanglement metrics of existing unsupervised disentanglement methods, standard SSL (SimCLR [17]) and IP-IRM using CMNIST [2] and Shapes3D [52]. Note that IRS is based on intervening the semantics which requires access to the labels of all the semantics, and hence not applicable for CMNIST dataset. Results are averaged over 4 trails (mean std).

Results. In Table 1, we compared the proposed IP-IRM to the standard SSL method SimCLR [17] as well as several generative disentanglement methods [53, 43, 10, 16, 52]. On both CMNIST and Shapes3D dataset, IP-IRM outperforms SimCLR regarding all metrics except for only IRS where the most relative gain is 8.8% for MOD. For this MOD, we notice that VAE performs better than our IP-IRM by 6 points, i.e., 0.82 v.s. 0.76 for Shapes3D. This is because VAE explicitly pursues a high modularity score through regularizing the dimension-wise independence in the feature space. However, this regularization is adversarial to discriminative objectives [15, 97]. Indeed, we can observe from the column of LR (i.e., the performance of downstream linear classification) that VAE methods have clearly poor performance especially on the more challenging dataset Shapes3D. We can draw the same conclusion from the results of GBT. Different from VAE methods, our IP-IRM is optimized towards disentanglement without such regularization, and is thus able to outperform the others in downstream tasks while obtaining a competitive value of modularity.

What do IP-IRM features look like? Figure 4 visualizes the features learned by SimCLR and our IP-IRM on two datasets: CMNIST in Figure 4 (a) and STL10 dataset in Figure 4 (b). In the following, we use Figure 4 (a) as the example, and can easily draw the similar conclusions from Figure 4 (b). On the left-hand side of Figure 4 (a), it is obvious that there is no clear boundary to distinguish the semantic of color in the SimCLR feature space. Besides, the features of the same digit semantic are scattered in two regions. On the right-hand side of (a), we have 3 observations for IP-IRM. 1) The features are well clustered and each cluster corresponds to a specific semantic of either digit or color. This validates the equivariant property of IP-IRM representation that it responds to any changes of the existing semantics, e.g., digit and color on this dataset. 2) The feature space has the symmetrical structure for each individual semantic, validating the decomposable property of IP-IRM representation. More specifically, i) mirroring a feature (w.r.t. “*” in the figure center) indicates the change on the only semantic of color, regardless of the other semantic (digit); and ii) a counterclockwise rotation (denoted by black arrows from same-colored 1 to 7) indicates the change on the only semantic of digit. 3) IP-IRM reveals the true distribution (similarity) of different classes. For example, digits 3, 5, 8 sharing sub-parts (curved bottoms and turnings) have closer feature points in the IP-IRM feature space.

Figure 4: The t-SNE [87] visualizations of learned feature spaces using SimCLR [17] and IP-IRM on CMNIST [2] and STL10 [23]. For CMIST in (a), we annotate the digit and color near each cluster. We annotate only half of the feature points for SimCLR to avoid clutter. For STL10 in (b), we show the labels of the classes.
Figure 5: (a) Visualization of the obtained partitions during training. Each partition has two subset and the displayed images are randomly

sampled from each subset. (b) Visualization of the variance of each feature dimension when perturbing the semantic indicated on the left. The most equivariant dimensions are indicated by triangles and their corresponding indices.

How does IP-IRM disentangle features? 1) Discovered : To visualize the discovered partitions at each maximization step, we performed an experiment on a binary CMNIST (digit 0 and 1 in color red and green), and show the results in Figure 5 (a). Please kindly refer to Appendix for the full results on CMNIST. First, each partition tells apart a specific semantic into two subsets, e.g., in Partition #1, red and green digits are separated. Second, besides the obvious semantics—digit and color (labelled on the dataset), we can discover new semantics, e.g., the digit slant shown in Partition #3. 2) Disentangled Representation: In Figure 5 (b), we aim to visualize how equivariant each feature dimension is to the change of each semantic, i.e., a darker color shows that a dimension is more equivariant w.r.t. the semantic indicated on the left. We can see that SimCLR fails to learn the decomposable representation, e.g., the 8-th dimension captures azimuth, type and size in Shapes3D. In contrast, our IP-IRM achieves disentanglement by representing the semantics into interpretable dimensions, e.g., the 6-th and 7-th dimensions captures the size, the 4-th for type and the 2-nd and 9-th for azimuth on the Shapes3D. Overall, the results support the justification in Section 4, i.e., we discover a new semantic (affected by ) through the partition at each iteration and IP-IRM eventually converges with a disentangled representation.

5.2 Self-Supervised Learning

Datasets and Settings. We conducted the SSL evaluations on 2 standard benchmarks following [90, 21, 50]. Cifar100 [55] contains 60,000 images in 100 classes and STL10 [23] has 113,000 images in 10 classes. We used SimCLR [17], DCL [21] and HCL [50] as baselines, and learned the representations for and epochs. We evaluated both linear and -NN () accuracies for the downstream classification task. Implementation details are in appendix.

Method STL10 Cifar100
-NN Linear -NN Linear
400 epoch training
SimCLR [17] 73.60 78.89 54.94 66.63
DCL [21] 78.82 82.56 57.29 68.59
HCL [50] 80.06 87.60 59.61 69.22
SimCLR+IP-IRM 79.66 84.44 59.10 69.55
DCL+IP-IRM 81.51 85.36 58.37 68.76
HCL+IP-IRM 84.29 87.81 60.05 69.95
1,000 epoch training
SimCLR [17] 78.60 84.24 59.45 68.73
SimCLR [57] 79.80 85.56 63.67 72.18
SimCLR+IP-IRM 85.08 89.91 65.82 73.99
Supervised - - - 73.72
Supervised+MixUp [100] - - - 74.19
Table 2: Accuracy (%) of -NN and linear classifiers on STL10 [23] and Cifar100 [55] using the representations of SimCLR [17], DCL [21], HCL [50] and those after incorporating our IP-IRM. SimCLR denotes SimCLR with MixUp regularization. Supervised represents the supervised learning that keeps the same codebase, optimizer and parameters with SSL stage-2 fine-tuning while only adds the learning rate decay at 60 and 80 epoch.

Results. We demonstrate our results and compare with baselines in Table 2. Incorporating IP-IRM to the 3 baselines brings consistent performance boosts to downstream classification models in all settings, e.g., improving the linear models by 5.55% on STL10 and 2.92% on Cifar100. In particular, we observe that IP-IRM brings huge performance gain with -NN classifiers, e.g., 4.23% using HCL+IP-IRM on STL10, i.e., the distance metrics in the IP-IRM feature space more faithfully reflects the class semantic differences. This validates that our algorithm further disentangles compared to the standard SSL Moreover, by extending the training process to 1,000 epochs with MixUp [57], SimCLR+IP-IRM achieves further performance boost on both datasets, e.g., 5.28% for -NN and 4.35% for linear classifier over SimCLR baseline on STL10 dataset. Notably, our SimCLR+IP-IRM surpasses vanilla supervised learning on Cifar100 under the same evaluation setting. Still, the quality of disentanglement cannot be fully evaluated when the training and test samples are identically distributed—while the improved accuracy demonstrates that IP-IRM representation is more equivariant to class semantics, it does not reveal if the representation is decomposable. Hence we present an out-of-distribution (OOD) setting in Section 5.3 to further show this property.

Figure 6: Our ablation study on the STL10 and Cifar100 datasets. (a) The Top-1 accuracy (%) of linear classifiers using different values of and (in Eq. (2) and Eq. (3)), by training for 200 epochs on two datasets. (b) The Top-1 accuracy (%) of -NN classifiers on two datasets, for which we trained the models for 700 epochs and updated every 50 epochs.

Is IP-IRM sensitive to the values of hyper-parameters? 1) and in Eq. (2) and Eq. (3). In Figure 6 (a), we observe that the best performance is achieved with and taking values from to on both datasets. All accuracies drop sharply if using . The reason is that a higher forces the model to push the -induced similarity to fixed baseline , rather than decrease the loss on the pretext task, leading to poor convergence. 2) The number of epochs. In Figure 6 (b), we plot the Top-1 accuracies of using -NN classifiers along the 700-epoch training of two kinds of SSL representations—SimCLR and IP-IRM. It is obvious that IP-IRM converges faster and achieves a higher accuracy than SimCLR. It is worth to highlight that on the STL10, the accuracy of SimCLR starts to oscillate and grow slowly after the 150-th epoch, while ours keeps on improving. This is an empirical evidence that IP-IRM keeps on disentangling more and more semantics in the feature space, and has the potential of improvement through long-term training.

5.3 Potential on Large-Scale Data

Datasets. We evaluated on the standard benchmark of supervised learning ImageNet ILSVRC-2012 [26] which has in total 1,331,167 images in 1,000 classes. To further reveal if a representation is decomposable, we used NICO [39], which is a real-world image dataset designed for OOD evaluations. It contains 25,000 images in 19 classes, with a strong correlation between the foreground and background in the train split (e.g., most dogs on grass). We also studied the transferability of the learned representation following [31, 54]

: FGVC Aircraft (

Aircraft[62], Caltech-101 (Caltech[33]

, Stanford Cars (

Cars[95], Cifar10 [56], Cifar100 [56], DTD [22], Oxford 102 Flowers (Flowers[64], Food-101 (Food[6], Oxford-IIIT Pets (Pets[69] and SUN397 (SUN[93]

. These datasets include coarse- to fine-grained classification tasks, and vary in the amount of training data (2,000-75,000 images) and classes (10-397 classes), representing a wide range of transfer learning settings.


. For the ImageNet, all the representations were trained for 200 epochs due to limited computing resources. We followed the common setting 

[82, 38], using a linear classifier, and report Top-1 classification accuracies. For NICO, we fixed the ImageNet pre-trained ResNet-50 backbone and fine-tuned the classifier. See appendix for more training details. For the transfer learning, we followed [31, 54]

to report the classification accuracies on Cars, Cifar-10, Cifar-100, DTD, Food, SUN and the average per-class accuracies on Aircraft, Caltech, Flowers, Pets. We call them uniformly as Accuracy. We used the few-shot

-way--shot setting for model evaluation. Specifically, we randomly sampled 2,000 episodes from the test splits of above datasets. An episode contains classes, each with training samples and 15 testing samples, where we fine-tuned the linear classifier (backbone weights frozen) for 100 epochs on the training samples, and evaluated the classifier on the testing samples. We evaluated with (results of in Appendix).

Method ImageNet NICO InsDis [92] 56.5 65.6 PCL [58] 61.5 72.6 PIRL [63] 63.6 69.1 MoCo-v1 [38] 60.6 69.3 SimCLR (repro.) [17] 63.1 64.5 MoCo-v2 (repro.) [19] 67.3 78.0 SimSiam (repro.) [20] 68.8 66.7 SimCLR+IP-IRM 64.8 66.7 MoCo-v2+IP-IRM 67.6 79.5 SimSiam+IP-IRM 69.1 70.9 Table 3: ImageNet and NICO Top-1 Accuracy (%) of linear classifiers trained on the representations learnt with different SSL methods.
Figure 7: Visualization of CAM [101] on images from NICO [39] dataset using representations of the baseline MoCo-v2 [19] and our IP-IRM.
Method Aircraft Caltech Cars Cifar10 Cifar100 DTD Flowers Food Pets SUN Average
InsDis [92] 35.07 75.97 37.49 51.49 57.61 69.38 77.35 50.01 66.38 74.97 59.57
PCL [58] 36.86 90.72 39.68 59.26 60.78 69.53 67.50 57.06 88.31 84.51 65.42
PIRL [63] 36.70 78.63 39.21 49.85 55.23 70.43 78.37 51.61 69.40 76.64 60.61
MoCo-v1 [38] 35.31 79.60 36.35 46.96 51.62 68.76 75.42 49.77 68.32 74.77 58.69
MoCo-v2 [19] 31.98 92.32 41.47 56.50 63.33 78.00 80.05 57.25 83.23 88.10 67.22
IP-IRM (Ours) 32.98 93.16 42.87 60.73 68.54 79.30 82.68 59.61 85.23 89.38 69.44
Table 4: Accuracy (%) of 5-way-5-shot few-shot evaluation using the image representation learned on ImageNet [26]. More detailed results are given in Appendix.

ImageNet and NICO. In Table 3 ImageNet accuracy, our IP-IRM achieves the best performance over all baseline models. Yet we believe that this does not show the full potential of IP-IRM, because ImageNet is a larger-scale dataset with many semantics, and it is hard to achieve a full disentanglement of all semantics within the limited 200 epochs. To evaluate the feature decomposability of IP-IRM, we compared the performance on NICO with various SSL baselines in Table 3, where our approach significantly outperforms the baselines by 1.5-4.2%. This validates IP-IRM feature is more decomposable—if each semantic feature (e.g., background) is decomposed in some fixed dimensions and some classes vary with such semantic, then the classifier will recognize this as a non-discriminative variant feature and hence focus on other more discriminative features (i.e., foreground). In this way, even though some classes are confounded by those non-discriminative features (e.g., most of the “dog” images are with “grass” background), the fixed dimensions still help classifiers neglect those non-discriminative ones. We further visualized the CAM [101] on NICO in Figure 7, which indeed shows that IP-IRM helps the classifier focus on the foreground regions.

Few-Shot Tasks. As shown in Table 4, our IP-IRM significantly improves the performance of 5-way-5-shot setting, e.g., we outperform the baseline MoCo-v2 by 2.2%. This is because IP-IRM can further disentangled over SSL, which is essential for representations to generalize to different downstream class domains (recall Corollary 2 of Lemma 1). This is also in line with recent works [88] showing that a disentangled representation is especially beneficial in low-shot scenarios, and further demonstrates the importance of disentanglement in downstream tasks.

6 Conclusion

We presented an unsupervised disentangled representation learning method called Iterative Partition-based Invariant Risk Minimization (IP-IRM), based on Self-Supervised Learning (SSL). IP-IRM iteratively partitions the dataset into semantic-related subsets, and learns a representation invariant across the subsets using SSL with an IRM loss. We show that with theoretical guarantee, IP-IRM converges with a disentangled representation under the group-theoretical view, which fundamentally surpasses the capabilities of existing SSL and fully-supervised learning. Our proposed theory is backed by strong empirical results in disentanglement metrics, SSL classification accuracy and transfer performance. IP-IRM achieves disentanglement without using generative models, making it widely applicable on large-scale visual tasks. As future directions, we will continue to explore the application of group theory in representation learning and seek additional forms of inductive bias for faster convergence.

Acknowledgments and Disclosure of Funding

The authors would like to thank all reviewers for their constructive suggestions. This research is partly supported by the Alibaba-NTU Joint Research Institute, the A*STAR under its AME YIRG Grant (Project No. A20E6c0101), and the Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 2 grant.


  • [1] P. W. Anderson (1972) More is different. Science. Cited by: §1.
  • [2] M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893. Cited by: §C.3.1, §1, §3, Figure 4, §5.1, Table 1.
  • [3] P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: §2.
  • [4] Y. Bengio, A. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §1.
  • [5] Y. Bengio (2009) Learning deep architectures for ai. Now Publishers Inc. Cited by: §2.
  • [6] L. Bossard, M. Guillaumin, and L. Van Gool (2014)

    Food-101–mining discriminative components with random forests


    European conference on computer vision

    Cited by: §5.3.
  • [7] L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §C.2.1.
  • [8] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1993) Signature verification using a" siamese" time delay neural network. Advances in neural information processing systems 6, pp. 737–744. Cited by: §2.
  • [9] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems, Cited by: §1.
  • [10] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner (2018) Understanding disentangling in -vae. arXiv preprint arXiv:1804.03599. Cited by: Table 12, §5.1, Table 1.
  • [11] R. Cai, Z. Li, P. Wei, J. Qiao, K. Zhang, and Z. Hao (2019) Learning disentangled semantic representation for domain adaptation. In IJCAI: proceedings of the conference, Cited by: §2.
  • [12] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018)

    Deep clustering for unsupervised learning of visual features

    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §2.
  • [13] M. Caron, P. Bojanowski, J. Mairal, and A. Joulin (2019) Unsupervised pre-training of image features on non-curated data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2959–2968. Cited by: §2.
  • [14] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882. Cited by: §2.
  • [15] L. Chen, H. Zhang, J. Xiao, W. Liu, and S. Chang (2018) Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In CVPR, Cited by: §5.1.
  • [16] R. T. Chen, X. Li, R. Grosse, and D. Duvenaud (2018)

    Isolating sources of disentanglement in variational autoencoders

    In Advances in neural information processing systems, Cited by: Table 12, §2, §5.1, Table 1.
  • [17] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. Cited by: Table 11, Table 12, Figure 2, §1, §1, §2, Figure 4, Figure 7, §5.1, §5.2, Table 1, Table 2.
  • [18] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, Cited by: §2.
  • [19] X. Chen, H. Fan, R. Girshick, and K. He (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: Table 13, Table 14, Table 15, §2, Figure 7, Figure 7, Table 4.
  • [20] X. Chen and K. He (2020) Exploring simple siamese representation learning. arXiv preprint arXiv:2011.10566. Cited by: §C.3.1, §2, Figure 7.
  • [21] C. Chuang, J. Robinson, L. Yen-Chen, A. Torralba, and S. Jegelka (2020) Debiased contrastive learning. arXiv preprint arXiv:2007.00224. Cited by: §C.3.1, Table 11, §2, §5.2, Table 2.
  • [22] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi (2014) Describing textures in the wild. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §5.3.
  • [23] A. Coates, A. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    pp. 215–223. Cited by: Figure 2, Figure 4, §5.2, Table 2.
  • [24] T. S. Cohen, M. Geiger, J. Köhler, and M. Welling (2018) Spherical cnns. In ICLR, Cited by: §2.
  • [25] T. Cohen and M. Welling (2014) Learning the irreducible representations of commutative lie groups. In International Conference on Machine Learning, Cited by: §2.
  • [26] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §C.5, Table 13, Table 14, Table 15, §5.3, Table 4.
  • [27] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Cited by: §1.
  • [28] K. Do and T. Tran (2020)

    Theory and evaluation metrics for learning disentangled representations

    In International conference on learning representations, Cited by: §2.
  • [29] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422–1430. Cited by: §2.
  • [30] C. Eastwood and C. K. Williams (2018) A framework for the quantitative evaluation of disentangled representations. In International conference on learning representations, Cited by: §C.2.1, §2, §5.1.
  • [31] L. Ericsson, H. Gouk, and T. M. Hospedales (2021) How Well Do Self-Supervised Models Transfer?. In CVPR, Cited by: §C.5.1, §C.5, Table 13, §5.3, §5.3.
  • [32] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision. Cited by: Table 13.
  • [33] L. Fei-Fei, R. Fergus, and P. Perona (2004) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, Cited by: §5.3.
  • [34] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research. Cited by: §1.
  • [35] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §2.
  • [36] J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. Cited by: §2.
  • [37] W.F.J. Harris, W. Fulton, and J. Harris (1991) Representation theory: a first course. External Links: ISBN 9780387974958 Cited by: §1, §2.
  • [38] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: Table 13, Table 14, Table 15, §1, §2, Figure 7, §5.3, Table 4.
  • [39] Y. He, Z. Shen, and P. Cui (2021) Towards non-iid image classification: a dataset and baselines. Pattern Recognition 110, pp. 107383. Cited by: §C.4.1, Table 10, Figure 7, §5.3.
  • [40] O. Henaff (2020) Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pp. 4182–4192. Cited by: §2.
  • [41] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework. In ICLR, Cited by: §2.
  • [42] I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, and A. Lerchner (2018) Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230. Cited by: Self-Supervised Learning Disentangled Group Representation as Feature, §1, §2.
  • [43] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework. International conference on learning representations. Cited by: Table 12, §5.1, Table 1.
  • [44] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018)

    Learning deep representations by mutual information estimation and maximization

    arXiv preprint arXiv:1808.06670. Cited by: §2.
  • [45] J. Hsieh, B. Liu, D. Huang, L. Fei-Fei, and J. C. Niebles (2018) Learning to decompose and disentangle representations for video prediction. In Advances in neural information processing systems, Cited by: §2.
  • [46] Q. Hu, X. Wang, W. Hu, and G. Qi (2020) AdCo: adversarial contrast for efficient learning of unsupervised representations from self-trained negative adversaries. arXiv preprint arXiv:2011.08435. Cited by: §2.
  • [47] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry (2019) Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, Cited by: §1.
  • [48] A. Jaegle, S. Phillips, D. Ippolito, and K. Daniilidis (2018) Understanding image motion with group representations. Cited by: §2.
  • [49] T. W. Judson (1994) Abstract algebra: theory and applications (the prindle, weber & schmidt series in advanced mathematics). Prindle Weber & Schmidt. Cited by: §2, §4, footnote 1.
  • [50] Y. Kalantidis, M. B. Sariyildiz, N. Pion, P. Weinzaepfel, and D. Larlus (2020) Hard negative mixing for contrastive learning. arXiv preprint arXiv:2010.01028. Cited by: §C.3.1, Table 11, §2, §5.2, Table 2.
  • [51] T. Karaletsos, S. Belongie, and G. Rätsch (2015) Bayesian representation learning with oracle constraints. International conference on learning representations. Cited by: §2.
  • [52] H. Kim and A. Mnih (2018) Disentangling by factorising. In International Conference on Machine Learning, Cited by: Table 12, §2, §5.1, §5.1, Table 1.
  • [53] D. Kingma and M. Welling (2014) Auto-encoding variational bayes. In ICLR, Cited by: Table 12, §5.1, Table 1.
  • [54] S. Kornblith, J. Shlens, and Q. V. Le (2019) Do better imagenet models transfer better?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §5.3, §5.3.
  • [55] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §5.2, Table 2.
  • [56] A. Krizhevsky (2012) Learning multiple layers of features from tiny images. Cited by: §5.3.
  • [57] K. Lee, Y. Zhu, K. Sohn, C. Li, J. Shin, and H. Lee (2021) I-mix: a domain-agnostic strategy for contrastive representation learning. In ICLR, Cited by: §C.3.1, §5.2, Table 2.
  • [58] J. Li, P. Zhou, C. Xiong, R. Socher, and S. C. Hoi (2020) Prototypical contrastive learning of unsupervised representations. arXiv preprint arXiv:2005.04966. Cited by: Table 13, Table 14, Table 15, §2, Figure 7, Table 4.
  • [59] Z. Lin, K. Thekumparampil, G. Fanti, and S. Oh (2020) Infogan-cr and modelcentrality: self-supervised model training and selection for disentangling gans. In International Conference on Machine Learning, Cited by: §2.
  • [60] F. Locatello, M. Tschannen, S. Bauer, G. Rätsch, B. Schölkopf, and O. Bachem (2020) Disentangling factors of variations using few labels. In 8th International Conference on Learning Representations (ICLR), Cited by: §2.
  • [61] F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf, and O. Bachem (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, Cited by: §C.2.1, §C.2.1, §2, §5.1.
  • [62] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151. Cited by: §5.3.
  • [63] I. Misra and L. v. d. Maaten (2020) Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717. Cited by: Table 13, Table 14, Table 15, §2, Figure 7, Table 4.
  • [64] M. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, Cited by: §5.3.
  • [65] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pp. 69–84. Cited by: §2.
  • [66] U. Ojha, K. K. Singh, C. Hsieh, and Y. J. Lee (2020) Elastic-infogan: unsupervised disentangled representation learning in class-imbalanced data. In Advances in neural information processing systems, Cited by: §2.
  • [67] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.
  • [68] G. Parascandolo, N. Kilbertus, M. Rojas-Carulla, and B. Schölkopf (2018) Learning independent causal mechanisms. In Proceedings of the 35th International Conference on Machine Learning, pp. 4036–4044. Cited by: §1.
  • [69] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar (2012) Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, Cited by: §5.3.
  • [70] R. Quessard, T. Barrett, and W. Clements (2020) Learning disentangled representations and group structure of dynamical environments. Advances in Neural Information Processing Systems. Cited by: §2.
  • [71] R. P. Rao and D. L. Ruderman (1999) Learning lie groups for invariant visual perception. Advances in neural information processing systems. Cited by: §1.
  • [72] S. Reed, K. Sohn, Y. Zhang, and H. Lee (2014) Learning to disentangle factors of variation with manifold interaction. In International conference on machine learning, Cited by: §2.
  • [73] X. Ren, T. Yang, Y. Wang, and W. Zeng (2021) Do generative models know disentanglement? contrastive learning is all you need. arXiv preprint arXiv:2102.10543. Cited by: §2.
  • [74] K. Ridgeway and M. C. Mozer (2018) Learning deep disentangled embeddings with the f-statistic loss. In Advances in neural information processing systems, Cited by: §C.2.1, §2, §5.1.
  • [75] J. Robinson, C. Chuang, S. Sra, and S. Jegelka (2020) Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592. Cited by: §2.
  • [76] B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij (2012) On causal and anticausal learning. In Proceedings of the 29th International Conference on Machine Learning, Cited by: §1.
  • [77] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, Cited by: §C.1, Figure 2.
  • [78] I. Sosnovik, M. Szmaja, and A. Smeulders (2020) Scale-equivariant steerable networks. Cited by: §2.
  • [79] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §C.1, Figure 2.
  • [80] R. Suter, D. Miladinovic, S. Bauer, and B. Schölkopf (2018) Interventional robustness of deep latent variable models. arXiv. Cited by: §2.
  • [81] R. Suter, D. Miladinovic, B. Schölkopf, and S. Bauer (2019) Robustly disentangled causal mechanisms: validating deep representations for interventional robustness. In International Conference on Machine Learning, Cited by: §C.2.1, §2, §5.1.
  • [82] Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §C.3.1, Table 11, §2, §5.3.
  • [83] Y. Tian, D. Krishnan, and P. Isola (2020) Contrastive multiview coding. In European conference on computer vision, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cited by: Figure 2.
  • [84] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola (2020) What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243. Cited by: §2.
  • [85] Y. Tian, X. Chen, and S. Ganguli (2021) Understanding self-supervised learning dynamics without contrastive pairs. arXiv preprint arXiv:2102.06810. Cited by: §2.
  • [86] L. Tran, X. Yin, and X. Liu (2017)

    Disentangled representation learning gan for pose-invariant face recognition

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [87] L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. Journal of machine learning research 9 (11). Cited by: Figure 4.
  • [88] S. Van Steenkiste, F. Locatello, J. Schmidhuber, and O. Bachem (2019) Are disentangled representations helpful for abstract visual reasoning?. In Advances in neural information processing systems, Cited by: §D.4, §2, §5.3.
  • [89] T. Wang, C. Zhou, Q. Sun, and H. Zhang (2021) Causal attention for unbiased visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1.
  • [90] X. Wang, Z. Liu, and S. X. Yu (2020) Unsupervised feature learning by cross-level discrimination between instances and groups. arXiv preprint arXiv:2008.03813. Cited by: §2, §5.2.
  • [91] D. E. Worrall and M. Welling (2019) Deep scale-spaces: equivariance over scale. In NeurIPS, Cited by: §2.
  • [92] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: Table 13, Table 14, Table 15, §2, Figure 7, Table 4.
  • [93] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba (2010)

    Sun database: large-scale scene recognition from abbey to zoo

    In 2010 IEEE computer society conference on computer vision and pattern recognition, Cited by: §5.3.
  • [94] T. Xiao, X. Wang, A. A. Efros, and T. Darrell (2021) What should not be contrastive in contrastive learning. In International Conference on Learning Representations, Cited by: §4.
  • [95] L. Yang, P. Luo, C. C. Loy, and X. Tang (2015) A large-scale car dataset for fine-grained categorization and verification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.3.
  • [96] M. Ye, X. Zhang, P. C. Yuen, and S. Chang (2019) Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6210–6219. Cited by: §2.
  • [97] Z. Yue, T. Wang, H. Zhang, Q. Sun, and X. Hua (2021) Counterfactual zero-shot and open-set visual recognition. In CVPR, Cited by: §1, §5.1.
  • [98] Z. Yue, H. Zhang, Q. Sun, and X. Hua (2020) Interventional few-shot learning. In NeurIPS, Cited by: §D.5, item 4.
  • [99] J. Zaidi, J. Boilard, G. Gagnon, and M. Carbonneau (2020) Measuring disentanglement: a review of metrics. arXiv preprint arXiv:2012.09276. Cited by: §C.2.1.
  • [100] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. In ICLR, Cited by: Table 2.
  • [101] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: Figure 7, §5.3.
  • [102] Y. Zhu, M. R. Min, A. Kadav, and H. P. Graf (2020) S3VAE: self-supervised sequential vae for representation disentanglement and data generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [103] Z. Zhu, P. Luo, X. Wang, and X. Tang (2014)

    Multi-view perceptron: a deep model for learning face identity and view representations

    Advances in Neural Information Processing Systems 27 (NIPS 2014). Cited by: §2.

Appendix A Preliminaries

A group is a set together with a binary operation, which takes two elements in the group and maps them to another element. For example, the set of integers is a group under the binary operation of plus. We formalize the notion through the following definition.

Binary Operation. A binary operation on a set is a function mapping into . For each , we denote the element by .

Group. A group is a set , closed under a binary operation , such that the following axioms hold:

  1. [leftmargin=+.2in]

  2. Associativity. , we have .

  3. Identity Element. , such that , .

  4. Inverse. , , such that .

Groups often arise as transformations of some space, such as a set, vector space, or topological space. Consider an equilateral triangle. The set of clockwise rotations w.r.t. its centroid to retain its appearance forms a group , with the last element corresponding to an identity mapping. We say this group of rotations act on the triangle, which is formally defined below.

Group Action. Let be a group with binary operation and be a set. An action of on is a map so that and , where , denotes functional composition. , denote as .

In our formulation, we have a group acting on the semantic space . For example, consider the color semantic, which can be mapped to a circle representing the hue. Hence the group acting on it corresponds to rotations, similar to the triangle example, e.g., may correspond to rotating a color in clockwise by . In the context of representation learning, we are interested to learn a feature space to reflect , formally defined below.

Group Representation. Let be a group. A representation of (or -representation) is a pair , where is a vector space and is a group action, i.e., for each , is a linear map.

Intuitively, each corresponds to a linear map, i.e., a matrix that transforms a vector to . Finally, there is a decomposition of semantic space and the group acting on it in our definition of disentangled representation. The decomposition of semantic space is based on the Cartesian product . A similar concept is defined w.r.t. group.

Direct Product of Group. Let be groups with the binary operation . Let for . Define to be the element . Then or is the direct product of the groups under the binary operation .

With this, we can formally define is only affected by the action of and fixed by the action of other subgroups: is a trivial sub-representation (“fixed”), i.e., for each , is the identity mapping , and is non-trivial (“affected”).

Appendix B Proof

b.1 Proof of Definition 2

Defines a Partition of . We will show that defines an equivalence relation on , which naturally leads to a partition of . For , let if and only if such that . We show that satisfies the three properties of equivalence relation. 1) Reflexive: , we have , hence . 2) Symmetric: Suppose , i.e., for some . Then , i.e., . 3) Transitive: if and , then and for some . Hence and .

Number of Orbits. Recall that acts transitively on (see Section 4). We consider the non-trivial case where the action of is faithful, i.e., the only group element that maps all to itself is the identity element . Let . We will show that each corresponds to a unique orbit. 1) , . Suppose , such that for some , the action of on each corresponds to the identity mapping. One can show that for every different orbit, i.e., , the action of on each is also identity mapping. As partitions into orbits w.r.t. , this means that the action of is identity mapping on all , which contradicts with the action of being faithful. 2) The previous step shows that non-identity group elements in lead to a different orbit. We need to further show that these orbits are unique, i.e., , if , then . Suppose , i.e., , so , where is the point stabilizer of . As the action of is faithful, . Hence implies .

b.2 Details of Lemma 1

We will first prove Lemma 1 by showing the representation is -equivariant, followed by showing that and are decomposable and finally showing that is not decomposable. We will then present more details on the 4 corollaries.

Proof of -equivariant. Suppose that the training loss is minimized, yet for . Let in the denominator, and we have , where is the angle between the two vectors. When , . So keeping constant (i.e., the same regularization penalty such as L2), can be further reduced if , which reduces the training loss. This contradicts with the earlier assumption. Hence by minimizing the training loss, we can achieve sample-equivariant, i.e., different samples have different features. Note that this does not necessarily mean group-equivariant. However, the variation of training samples is all we know about the group action of , and we establish that the action of is transitive on , hence we use the sample-equivariant features as the approximation of -equivariant features.

Proof of Decomposability between and . Recall the semantic representation , which is show to be -equivariant in the previous step. Consider a non-decomposable representation where is affected by the action of both and . Let , where both sub-spaces are affected by the action of the two groups. In particular, denote the semantic representation , where is affected by the action of (recall that affects and through the equivariant map in Figure 1) and is affected by the action of . From here, we will construct a representation where is only affected by the action of with a lower training loss.

Specifically, we aim to assign a to the -th orbit, which is given by:


where is the value of for -th orbit. Now define given by . Using this new has two outcomes:

1) in the numerator is the linear combination of the dot similarity induced from and . And the dot similarity induced from is increased, as inside each orbit, the value in is the same (maximized similarity);

2) The denominator is now reduced. This is because the denominator is proportional to , and we have already selected the best set that minimizes the expected dot similarities across orbits.

As the in-orbit dot similarity increases (numerator), and the cross-orbit dot similarity decreases (denominator), the training loss is reduced by decomposing a separate sub-space affected only by the action of with . Furthermore, note that a linear projector is used in SSL to project the features into lower dimensions, and a linear weight is used in SL. To isolate the effect of to maximize the similarity of in-orbit samples (numerator) and exploit the action of to minimize the similarity of cross-orbit samples (denominator), the effect of and on must be separable by a linear layer, i.e., decomposable. Combined with the earlier proof that is only affected by the action of , without loss of generality, we have the decomposition affected by and , respectively.

Proof of Non-Decomposability of . We will show that for a representation with decomposed, there exists a non-decomposable representation that achieves the same expected dot similarity, hence having the same contrastive loss. Without loss of generality, consider acting on the semantic attribute space , respectively. Let be a decomposable representation such that there exists feature subspaces affected only by the action of , respectively. Denote . Now we define a non-decomposable representation with mapping and , given by and . Now for any pair of samples with semantic and , and , the dot similarity induced from the subspace is given by . Therefore, the decomposed and non-decomposed representations yield the same expected dot similarity over all pairs of samples, and have the same contrastive loss.

Corollary 1. This follows immediately from the proof on -equivariance.

Corollary 2. The same proof above holds for the SL case with , where is the set of classifier weights. In this view, each sample in the class can be seen as an augmented view (augmented by shared attributes such as view angle, pose, etc) of the class prototype. In downstream learning, the shared attributes are not discriminative, hence the performance is affected mostly by . For example, if the groups corresponding to “species” and “shape” act on the same feature subspace (entangled), such that “species”=“bird” always have “shape”=“streamlined” feature, this representation does not generalize to downstream tasks of classifying birds without streamlined shape (e.g., “kiwi”).

Corollary 3. In SL and SSL, the model essentially receives supervision on attributes that are not discriminative towards downstream tasks, through augmentations and in-class variations, respectively. The group acts on the semantic space of these attributes, hence determines the amount of supervision received. With a large , the model filters off more irrelevant semantics and more accurately describe the differences between classes. Note that the standard image augmentations in SSL are also used in SL, making even larger in SL.

Corollary 4. When the number of samples in some orbit(s) is smaller than , this has two consequences that prevent disentanglement: 1) The -equivariance is not guaranteed as the training samples do not fully describe . 2) The decomposability is not guaranteed as the decomposed in the previous proof only generalizes to the seen combination of the value in .

b.3 Proof of Theorem 1

We will first revisit the Invariant Risk Minimization (IRM). Let be the image space, the feature space, the classification output space (e.g.

, the set of all probabilities of belonging to each class), the feature extractor backbone

and the classifier . Let be a set of training environments, where each is a set of images. IRM aims to solve the following optimization problem:


where is the empirical classification risk in the environment using backbone and classifier . Conceptually, IRM aims to find a representation such that the optimal classifier on top of is the same for all environments. As Eq. (5) is a challenging, bi-leveled optimization problem, it is initiated into the practical version:


where is the regularizer balancing between the ERM term and invariant term.

The above IRM is formulated for supervised training. In SSL, there is no classifier mapping from . Instead, there is a projector network mapping features to another feature space , and Eq. (1) is used to compute the similarity with positive key (numerator) and negative keys (denominator) in . Note that in SSL is not equivalent to in SL, as itself does not generate the probability output like , rather, the comparison between positive and negative keys does.

In fact, the formulation of contrastive IRM is given by Corollary 2 of Lemma 1, which says that SL is a special case of contrastive learning, and the set of all classifier weights is the positive and negative key space. In IRM with SL, we are trying to find a set of weights from the classifier weights space (e.g., with feature dimension as and number of classes as ) that achieves invariant prediction. Hence in IRM with SSL, we are trying to find a set of keys from the key space (e.g., with being the dimension of and number of positive and negative keys as ) that achieves invariant prediction by differentiating a sample with negative keys (Note that the similarity with positive keys is maximized and fixed using standard SSL training by decomposing augmentations and other semantics as in Lemma 1). Specifically, in IP-IRM, the 2 subsets in each partition form the set of training environments .

Proof of the Sufficient Condition. Suppose that the representation is fully disentangled w.r.t. . By Definition 1, there exists subspace affected only by the action of . For each partition given by