1 Introduction
Deep learning is all about learning feature representations [4]. Compared to the conventional endtoend supervised learning, SelfSupervised Learning (SSL) first learns a generic feature representation (e.g., a network backbone) by training with unsupervised pretext tasks such as the prevailing contrastive objective [38, 17], and then the above stage1 feature is expected to serve various stage2 applications with proper finetuning. SSL for visual representation is so fascinating that it is the first time that we can obtain “good” visual features for free, just like the trending pretraining in NLP community [27, 9]. However, most SSL works only care how much stage2 performance an SSL feature can improve, but overlook what feature SSL is learning, why it can be learned, what cannot be learned, what the gap between SSL and Supervised Learning (SL) is, and when SSL can surpass SL?
The crux of answering those questions is to formally understand what a feature representation is and what a good one is. We postulate the classic world model of visual generation and feature representation [1, 71] as in Figure 1. Let be a set of (unseen) semantics, e.g., attributes such as “digit” and “color”. There is a set of independent and causal mechanisms [68] , generating images from semantics, e.g., writing a digit “0” when thinking of “0” [76]. A visual representation is the inference process that maps image pixels to vector space features, e.g.
, a neural network. We define
semantic representation as the functional composition. In this paper, we are only interested in the parameterization of the inference process for feature extraction, but not the generation process,
i.e., we assume , , such that is fixed as the observation of each image sample. Therefore, we consider semantic and visual representations the same as feature representation, or simply representation, and we slightly abuse , i.e., and share the same trainable parameters. We call the vector as feature, where .We propose to use Higgins’ definition of disentangled representation [42] to define what is “good”.
Definition 1. (Disentangled Representation) Let be the group acting on , i.e., transforms , e.g., a “turn green” group element changing the semantic from “red” to “green”. Suppose there is a direct product decomposition^{1}^{1}1Note that can also denote a cyclic subgroup such as rotation , or a countable one but treated as cyclic such as translation and color . and , where acts on respectively. A feature representation is disentangled if there exists a group acting on such that:

[leftmargin=+.2in]

Equivariant: , e.g., the feature of the changed semantic: “red” to “green” in , is equivalent to directly change the color vector in from “red” to “green”.

Decomposable: there is a decomposition , such that each is fixed by the action of all and affected only by , e.g., changing the “color” semantic in does not affect the “digit” vector in .
Compared to the previous definition of feature representation which is a static mapping, the disentangled representation in Definition 1 is dynamic as it explicitly incorporate group representation [37], which is a homomorphism from group to group actions on a space, e.g., , and it is common to use the feature space as a shorthand—this is where our title stands.
Definition 1 defines “good” features in the common views: 1) Robustness: a good feature should be invariant to the change of environmental semantics, such as external interventions [47, 89] or domain shifts [34]. By the above definition, a change is always retained in a subspace
, while others are not affected. Hence, the subsequent classifier will focus on the invariant features and ignore the everchanging
. 2) Zeroshot Generalization: even if a new combination of semantics is unseen in training, each semantic has been learned as features. So, the metrics of each trained by seen samples remain valid for unseen samples [97].Are the existing SSL methods learning disentangled representations? No. We show in Section 4 that they can only disentangle representations according to the handcrafted augmentations, e.g., color jitter and rotation. For example, in Figure 2 (a), even if we only use the augmentationrelated feature, the classification accuracy of a standard SSL (SimCLR [17]) does not lose much as compared to the full feature use. Figure 2 (b) visualizes that the CNN features in each layer are indeed entangled (e.g., tyre, motor, and background in the motorcycle image). In contrast, our approach IPIRM, to be introduced below, disentangles more useful features beyond augmentations.
In this paper, we propose Iterative Partitionbased Invariant Risk Minimization (IPIRM [aip:m]) that guarantees to learn disentangled representations in an SSL fashion. We present the algorithm in Section 3, followed by the theoretical justifications in Section 4. In a nutshell, at each iteration, IPIRM first partitions the training data into two disjoint subsets, each of which is an orbit of the already disentangled group, and the crossorbit group corresponds to an entangled group element . Then, we adopt the Invariant Risk Minimization (IRM) [2] to implement a partitionbased SSL, which disentangles the representation w.r.t. . Iterating the above two steps eventually converges to a fully disentangled representation w.r.t. . In Section 5, we show promising experimental results on various feature disentanglement and SSL benchmarks.
2 Related Work
SelfSupervised Learning. SSL aims to learn representations from unlabeled data with handcrafted pretext tasks [29, 65, 35]. Recently, Contrastive learning [67, 63, 40, 82, 17] prevails in most stateoftheart methods. The key is to map positive samples closer, while pushing apart negative ones in the feature space. Specifically, the positive samples are from the augmented views [84, 3, 96, 44] of each instance and the negative ones are other instances. Along this direction, followup methods are mainly fourfold: 1) Memorybank [92, 63, 38, 19]: storing the prototypes of all the instances computed previously into a memory bank to benefit from a large number of negative samples. 2) Using siamese network [8] to avoid representation collapse [36, 20, 85]. 3) Assigning clusters to samples to integrate interinstance similarity into contrastive learning [12, 13, 14, 90, 58]. 4) Seeking hard negative samples with adversarial training or better sampling strategies [75, 21, 46, 50]. In contrast, our proposed IPIRM jumps out of the above frame and introduces the disentangled representation into SSL with group theory to show the limitations of existing SSL and how to break through them.
Disentangled Representation. This notion dates back to [5], and henceforward becomes a highlevel goal of separating the factors of variations in the data [86, 81, 88, 60]. Several works aim to provide a more precise description [28, 30, 74] by adopting an informationtheoretic view [18, 28] and measuring the properties of a disentangled representation explicitly [30, 74]. We adopt the recent grouptheoretic definition from Higgins et al. [42], which not only unifies the existing, but also resolves the previous controversial points [80, 61]. Although supervised learning of disentangled representation is a wellstudied field [103, 45, 11, 72, 51], unsupervised disentanglement based on GAN [18, 66, 59, 73] or VAE [41, 16, 102, 52] is still believed to be theoretically challenging [61]. Thanks to the Higgins’ definition, we prove that the proposed IPIRM converges with fullsemantic disentanglement using group representation theory. Notably, IPIRM learns a disentangled representation with an inference process, without using generative models as in all the existing unsupervised methods, making IPIRM applicable even on largescale datasets.
Group Representation Learning. A group representation has two elements [49, 37]: 1) a homomorphism (e.g., a mapping function) from the group to its group action acting on a vector space, and 2) the vector space. Usually, when there is no ambiguity, we can use either element as the definition. Most existing works focus on learning the first element. They first define the group of interest, such as spherical rotations [24] or image scaling [91, 78], and then learn the parameters of the group actions [25, 48, 70]. In contrast, we focus on the second element; more specifically, we are interested in learning a map between two vector spaces: image pixel space and feature vector space. Our representation learning is flexible because it delays the group action learning to downstream tasks on demand. For example, in a classification task, a classifier can be seen as a group action that is invariant to classagnostic groups but equivariant to classspecific groups (see Section 4).
3 IPIRM Algorithm
Notations. Our goal is to learn the feature extractor in a selfsupervised fashion. We define a partition matrix that partitions training images into disjoint subsets. if the th image belongs to the th subset and
otherwise. Suppose we have a pretext task loss function
defined on the samples in the th subset, where is a “dummy” parameter used to evaluate the invariance of the SSL loss across the subsets (later discussed in Step 1). For example, can be defined as:(1) 
where , and is the augmented view feature of .
Input. training images. Randomly initialized . A partition matrix initialized such that the first column of is 1, i.e., all samples belong to the first subset. Set .
Output. Disentangled feature extractor .
Step 1 [Update ]. We update by:
(2) 
where is a hyperparameter. The second term delineates how far the contrast in one subset is from a constant baseline . The minimization of both of them encourages in different subsets close to the same baseline, i.e., invariance across the subsets. See IRM [2] for more details. In particular, the first iteration corresponds to the standard SSL with in Eq. (1) containing all training images.
Step 2 [Update ]. We fix and find a new partition by
(3) 
where is a hyperparameter. In practice, we use a continuous partition matrix in during optimization and then threshold it to .
We update and iterate the above two steps until convergence.
4 Justification
Recall that IPIRM uses training sample partitions to learn the disentangled representations w.r.t. . As we have a equivariant feature map between the sample space and feature space (the equivariance is later guaranteed by Lemma 1), we slightly abuse the notation by using to denote both spaces. Also, we assume that is a homogeneous space of , i.e., any sample can be transited from another sample by a group action . Intuitively, is all you need to describe the diversity of the training set. It is worth noting that is any group element in while is a Cartesian “building block” of , e.g., can be decomposed by .
We show that partition and group are tightly connected by the concept of orbit. Given a sample , its group orbit w.r.t. is a sample set . As shown in Figure 3 (a), if is a set of attributes shared by classes, e.g., “color” and “pose ”, the orbit is the sample set of the class of ; in Figure 3 (b), if denotes augmentations, the orbit is the set of augmented images. In particular, we can see that the disjoint orbits in Figure 3 naturally form a partition. Formally, we have the following definition:
Definition 2. (Orbit & Partition [49]) Given a subgroup , it partitions into the disjoint subsets: , where is the number of cosets , and the cosets form a factor group^{1}^{1}1Given with , then is a normal subgroup of , and is isomorphic to [49]. We write with slight abuse of notation. . In particular, can be considered as a sample of the th class, transited from any sample .
Interestingly, the partition offers a new perspective for the training data format in Supervised Learning (SL) and SelfSupervised Learning (SSL). In SL, as shown in Figure 3 (a), the data is labeled with classes, each of which is an orbit with training samples, whose variations are depicted by the classsharing attribute group . The crossorbit group action, e.g., , can be read as “turn into a dog” and such “turn” is always valid due to the assumption that is a homogeneous space of . In SSL, as shown in Figure 3 (b), each training sample is augmented by the group . So, consists of all the augmentations of the th sample, where the crossorbit group action can be read as “turn into the th sample”.
Thanks to the orbit and partition view of training data, we are ready to revisit model generalization in a grouptheoretic view by using invariance and equivariance—the two sides of the coin, whose name is disentanglement. For SL, we expect that a good feature is disentangled into a classagnostic part and a classspecific part: the former (latter) is invariant (equivariant) to —crossorbit traverse, but equivariant (invariant) to —inorbit traverse. By using such feature, a model can generalize to diverse testing samples (limited to variations) by only keeping the classspecific feature. Formally, we prove that we can achieve such disentanglement by contrastive learning:
Lemma 1.
(Disentanglement by Contrastive Learning) Training loss disentangles w.r.t. , where and are from the same orbit.
We can draw the following interesting corollaries from Lemma 1 (details in Appendix):

[leftmargin=+.2in]

If we use all the samples in the denominator of the loss, we can approximate to equivariant features given limited training samples. This is because the loss minimization guarantees , i.e., any pair corresponds to a group action.

Conventional crossentropy loss in SL is a special case, if we define as classifier weights. So, SL does not guarantee the disentanglement of , which causes generalization error if the class domain of downstream task is different from SL pretraining, e.g., a subset of .

In contrastive learning based SSL, (recall Figure 2), and the number of augmentations is generally much smaller compared to the classwise sample diversity in SL. This enables the SL model to generalize to more diverse testing samples () by filtering out the classagnostic features (e.g., background) and focusing on the classspecific ones (e.g., foreground), which explains why SSL is worse than SL in downstream classification.

In SL, if the number of training samples per orbit is not enough, i.e., smaller than , the disentanglement between and cannot be guaranteed, such as the challenges in fewshot learning [98]. Fortunately, in SSL, the number is enough as we always include all the augmented samples in training. Moreover, we conjecture that only contains simple cyclic group elements such as rotation and colorization, which are easier for representation learning.
Lemma 1 does not guarantee the decomposability of each . Nonetheless, the downstream model can still generalize by keeping the classspecific features affected by . Therefore, the key to fill the gap or even let SSL surpass SL is to achieve the full disentanglement of .
Theorem 1.
The representation is fully disentangled w.r.t. if and only if , the contrastive loss in Eq. (1) is invariant to the 2 orbits of partition , where .
The maximization in Step 2 is based on the contraposition of the sufficient condition of Theorem 1. Denote the currently disentangled group as (initially ). If we can find a partition to maximize the loss in Eq. (3), i.e., SSL loss is variant across the orbits, then such that the representation of is entangled, i.e., . Figure 3 (c) illustrates a discovered partition about color. The minimization in Step 1 is based on the necessary condition of Theorem 1. Based on the discovered , if we minimize Eq. (2), we can further disentangle and update . Overall, IPIRM converges as is finite. Note that an improved contrastive objective [94] can further disentangle each and achieve full disentanglement w.r.t. .
5 Experiments
5.1 Unsupervised Disentanglement
Datasets. We used two datasets. CMNIST [2] has 60,000 digit images with semantic labels of digits (09) and colors (red and green). These images differ in other semantics (e.g., slant and font) that are not labeled. Moreover, there is a strong correlation between digits and colors (most 04 in red and 59 in green), increasing the difficulty to disentangle them. Shapes3D [52] contains 480,000 images with 6 labelled semantics, i.e., size, type, azimuth, as well as floor, wall and object color. Note that we only considered the first three semantics for evaluation, as the standard augmentations in SSL will contaminate any colorrelated semantics.
Settings. We adopted 6 representative disentanglement metrics: Disentangle Metric for Informativeness (DCI) [30], Interventional Robustness Score (IRS) [81], Explicitness Score (EXP) [74], Modularity Score (MOD) [74] and the accuracy of predicting the groundtruth semantic labels by two classification models called logistic regression (LR) and gradient boosted trees (GBT) [61]. Specifically, DCI and EXP measure the explicitness, i.e.
, the values of semantics can be decoded from the feature using a linear transformation. MOD and IRS measure the modularity,
i.e., whether each feature dimension is equivariant to the shift of a single semantic. See Appendix for more detailed formula of the metrics. In evaluation, we trained CNNbased feature extractor backbones with comparable number of parameters for all the baselines and our IPIRM. The full implementation details are in Appendix.Method  DCI  IRS  MOD  EXP  LR  GBT  Average  
CMNIST  VAE [53]  0.948 0.004    0.664
0.121 
0.968
0.007 
0.824
0.019 
0.948 0.004  0.849
0.057 
VAE [43]  0.945
0.002 
  0.705
0.073 
0.963
0.006 
0.809
0.013 
0.945
0.003 
0.874
0.015 

AnnealVAE [10]  0.911
0.002 
  0.790
0.075 
0.965
0.007 
0.821
0.022 
0.911
0.002 
0.880
0.016 

TCVAE [16]  0.914
0.008 
  0.864
0.095 
0.962
0.010 
0.801
0.024 
0.914
0.008 
0.891
0.014 

FactorVAE [52]  0.916
0.004 
  0.893 0.056  0.947
0.011 
0.770
0.025 
0.916
0.005 
0.888
0.014 

SimCLR [17]  0.882
0.019 
  0.767
0.025 
0.976
0.011 
0.863
0.036 
0.876
0.015 
0.873
0.016 

IPIRM (Ours)  0.917 0.008    0.785 0.031  0.990 0.002  0.921 0.009  0.916 0.007  0.906 0.011  
Shapes3D  VAE [53]  0.351
0.026 
0.284
0.009 
0.820 0.015  0.802
0.054 
0.421
0.079 
0.352
0.027 
0.505
0.028 
VAE [43]  0.369
0.021 
0.283
0.012 
0.782
0.034 
0.807
0.018 
0.427
0.025 
0.368
0.023 
0.506
0.011 

AnnealVAE [10]  0.327
0.069 
0.412
0.049 
0.743
0.070 
0.643
0.013 
0.259
0.021 
0.328
0.070 
0.452
0.023 

TCVAE [16]  0.470
0.035 
0.291
0.023 
0.777
0.031 
0.821
0.054 
0.439
0.084 
0.469
0.034 
0.545
0.032 

FactorVAE [52]  0.340
0.021 
0.316
0.016 
0.815
0.041 
0.738
0.043 
0.319
0.045 
0.339
0.021 
0.478
0.020 

SimCLR [17]  0.535
0.016 
0.439 0.030  0.678
0.050 
0.949
0.005 
0.733
0.055 
0.536
0.015 
0.645
0.026 

IPIRM (Ours)  0.565 0.023  0.420 0.014  0.766 0.036  0.959 0.007  0.757 0.025  0.565 0.023  0.672 0.017  
Results. In Table 1, we compared the proposed IPIRM to the standard SSL method SimCLR [17] as well as several generative disentanglement methods [53, 43, 10, 16, 52]. On both CMNIST and Shapes3D dataset, IPIRM outperforms SimCLR regarding all metrics except for only IRS where the most relative gain is 8.8% for MOD. For this MOD, we notice that VAE performs better than our IPIRM by 6 points, i.e., 0.82 v.s. 0.76 for Shapes3D. This is because VAE explicitly pursues a high modularity score through regularizing the dimensionwise independence in the feature space. However, this regularization is adversarial to discriminative objectives [15, 97]. Indeed, we can observe from the column of LR (i.e., the performance of downstream linear classification) that VAE methods have clearly poor performance especially on the more challenging dataset Shapes3D. We can draw the same conclusion from the results of GBT. Different from VAE methods, our IPIRM is optimized towards disentanglement without such regularization, and is thus able to outperform the others in downstream tasks while obtaining a competitive value of modularity.
What do IPIRM features look like? Figure 4 visualizes the features learned by SimCLR and our IPIRM on two datasets: CMNIST in Figure 4 (a) and STL10 dataset in Figure 4 (b). In the following, we use Figure 4 (a) as the example, and can easily draw the similar conclusions from Figure 4 (b). On the lefthand side of Figure 4 (a), it is obvious that there is no clear boundary to distinguish the semantic of color in the SimCLR feature space. Besides, the features of the same digit semantic are scattered in two regions. On the righthand side of (a), we have 3 observations for IPIRM. 1) The features are well clustered and each cluster corresponds to a specific semantic of either digit or color. This validates the equivariant property of IPIRM representation that it responds to any changes of the existing semantics, e.g., digit and color on this dataset. 2) The feature space has the symmetrical structure for each individual semantic, validating the decomposable property of IPIRM representation. More specifically, i) mirroring a feature (w.r.t. “*” in the figure center) indicates the change on the only semantic of color, regardless of the other semantic (digit); and ii) a counterclockwise rotation (denoted by black arrows from samecolored 1 to 7) indicates the change on the only semantic of digit. 3) IPIRM reveals the true distribution (similarity) of different classes. For example, digits 3, 5, 8 sharing subparts (curved bottoms and turnings) have closer feature points in the IPIRM feature space.
How does IPIRM disentangle features? 1) Discovered : To visualize the discovered partitions at each maximization step, we performed an experiment on a binary CMNIST (digit 0 and 1 in color red and green), and show the results in Figure 5 (a). Please kindly refer to Appendix for the full results on CMNIST. First, each partition tells apart a specific semantic into two subsets, e.g., in Partition #1, red and green digits are separated. Second, besides the obvious semantics—digit and color (labelled on the dataset), we can discover new semantics, e.g., the digit slant shown in Partition #3. 2) Disentangled Representation: In Figure 5 (b), we aim to visualize how equivariant each feature dimension is to the change of each semantic, i.e., a darker color shows that a dimension is more equivariant w.r.t. the semantic indicated on the left. We can see that SimCLR fails to learn the decomposable representation, e.g., the 8th dimension captures azimuth, type and size in Shapes3D. In contrast, our IPIRM achieves disentanglement by representing the semantics into interpretable dimensions, e.g., the 6th and 7th dimensions captures the size, the 4th for type and the 2nd and 9th for azimuth on the Shapes3D. Overall, the results support the justification in Section 4, i.e., we discover a new semantic (affected by ) through the partition at each iteration and IPIRM eventually converges with a disentangled representation.
5.2 SelfSupervised Learning
Datasets and Settings. We conducted the SSL evaluations on 2 standard benchmarks following [90, 21, 50]. Cifar100 [55] contains 60,000 images in 100 classes and STL10 [23] has 113,000 images in 10 classes. We used SimCLR [17], DCL [21] and HCL [50] as baselines, and learned the representations for and epochs. We evaluated both linear and NN () accuracies for the downstream classification task. Implementation details are in appendix.
Method  STL10  Cifar100  

NN  Linear  NN  Linear  
400 epoch training  
SimCLR [17]  73.60  78.89  54.94  66.63 
DCL [21]  78.82  82.56  57.29  68.59 
HCL [50]  80.06  87.60  59.61  69.22 
SimCLR+IPIRM  79.66  84.44  59.10  69.55 
DCL+IPIRM  81.51  85.36  58.37  68.76 
HCL+IPIRM  84.29  87.81  60.05  69.95 
1,000 epoch training  
SimCLR [17]  78.60  84.24  59.45  68.73 
SimCLR [57]  79.80  85.56  63.67  72.18 
SimCLR+IPIRM  85.08  89.91  65.82  73.99 
Supervised        73.72 
Supervised+MixUp [100]        74.19 
Results. We demonstrate our results and compare with baselines in Table 2. Incorporating IPIRM to the 3 baselines brings consistent performance boosts to downstream classification models in all settings, e.g., improving the linear models by 5.55% on STL10 and 2.92% on Cifar100. In particular, we observe that IPIRM brings huge performance gain with NN classifiers, e.g., 4.23% using HCL+IPIRM on STL10, i.e., the distance metrics in the IPIRM feature space more faithfully reflects the class semantic differences. This validates that our algorithm further disentangles compared to the standard SSL Moreover, by extending the training process to 1,000 epochs with MixUp [57], SimCLR+IPIRM achieves further performance boost on both datasets, e.g., 5.28% for NN and 4.35% for linear classifier over SimCLR baseline on STL10 dataset. Notably, our SimCLR+IPIRM surpasses vanilla supervised learning on Cifar100 under the same evaluation setting. Still, the quality of disentanglement cannot be fully evaluated when the training and test samples are identically distributed—while the improved accuracy demonstrates that IPIRM representation is more equivariant to class semantics, it does not reveal if the representation is decomposable. Hence we present an outofdistribution (OOD) setting in Section 5.3 to further show this property.
Is IPIRM sensitive to the values of hyperparameters? 1) and in Eq. (2) and Eq. (3). In Figure 6 (a), we observe that the best performance is achieved with and taking values from to on both datasets. All accuracies drop sharply if using . The reason is that a higher forces the model to push the induced similarity to fixed baseline , rather than decrease the loss on the pretext task, leading to poor convergence. 2) The number of epochs. In Figure 6 (b), we plot the Top1 accuracies of using NN classifiers along the 700epoch training of two kinds of SSL representations—SimCLR and IPIRM. It is obvious that IPIRM converges faster and achieves a higher accuracy than SimCLR. It is worth to highlight that on the STL10, the accuracy of SimCLR starts to oscillate and grow slowly after the 150th epoch, while ours keeps on improving. This is an empirical evidence that IPIRM keeps on disentangling more and more semantics in the feature space, and has the potential of improvement through longterm training.
5.3 Potential on LargeScale Data
Datasets. We evaluated on the standard benchmark of supervised learning ImageNet ILSVRC2012 [26] which has in total 1,331,167 images in 1,000 classes. To further reveal if a representation is decomposable, we used NICO [39], which is a realworld image dataset designed for OOD evaluations. It contains 25,000 images in 19 classes, with a strong correlation between the foreground and background in the train split (e.g., most dogs on grass). We also studied the transferability of the learned representation following [31, 54]
: FGVC Aircraft (
Aircraft) [62], Caltech101 (Caltech) [33], Stanford Cars (
Cars) [95], Cifar10 [56], Cifar100 [56], DTD [22], Oxford 102 Flowers (Flowers) [64], Food101 (Food) [6], OxfordIIIT Pets (Pets) [69] and SUN397 (SUN) [93]. These datasets include coarse to finegrained classification tasks, and vary in the amount of training data (2,00075,000 images) and classes (10397 classes), representing a wide range of transfer learning settings.
Settings
. For the ImageNet, all the representations were trained for 200 epochs due to limited computing resources. We followed the common setting
[82, 38], using a linear classifier, and report Top1 classification accuracies. For NICO, we fixed the ImageNet pretrained ResNet50 backbone and finetuned the classifier. See appendix for more training details. For the transfer learning, we followed [31, 54]to report the classification accuracies on Cars, Cifar10, Cifar100, DTD, Food, SUN and the average perclass accuracies on Aircraft, Caltech, Flowers, Pets. We call them uniformly as Accuracy. We used the fewshot
wayshot setting for model evaluation. Specifically, we randomly sampled 2,000 episodes from the test splits of above datasets. An episode contains classes, each with training samples and 15 testing samples, where we finetuned the linear classifier (backbone weights frozen) for 100 epochs on the training samples, and evaluated the classifier on the testing samples. We evaluated with (results of in Appendix).Method  Aircraft  Caltech  Cars  Cifar10  Cifar100  DTD  Flowers  Food  Pets  SUN  Average 
InsDis [92]  35.07  75.97  37.49  51.49  57.61  69.38  77.35  50.01  66.38  74.97  59.57 
PCL [58]  36.86  90.72  39.68  59.26  60.78  69.53  67.50  57.06  88.31  84.51  65.42 
PIRL [63]  36.70  78.63  39.21  49.85  55.23  70.43  78.37  51.61  69.40  76.64  60.61 
MoCov1 [38]  35.31  79.60  36.35  46.96  51.62  68.76  75.42  49.77  68.32  74.77  58.69 
MoCov2 [19]  31.98  92.32  41.47  56.50  63.33  78.00  80.05  57.25  83.23  88.10  67.22 
IPIRM (Ours)  32.98  93.16  42.87  60.73  68.54  79.30  82.68  59.61  85.23  89.38  69.44 
ImageNet and NICO. In Table 3 ImageNet accuracy, our IPIRM achieves the best performance over all baseline models. Yet we believe that this does not show the full potential of IPIRM, because ImageNet is a largerscale dataset with many semantics, and it is hard to achieve a full disentanglement of all semantics within the limited 200 epochs. To evaluate the feature decomposability of IPIRM, we compared the performance on NICO with various SSL baselines in Table 3, where our approach significantly outperforms the baselines by 1.54.2%. This validates IPIRM feature is more decomposable—if each semantic feature (e.g., background) is decomposed in some fixed dimensions and some classes vary with such semantic, then the classifier will recognize this as a nondiscriminative variant feature and hence focus on other more discriminative features (i.e., foreground). In this way, even though some classes are confounded by those nondiscriminative features (e.g., most of the “dog” images are with “grass” background), the fixed dimensions still help classifiers neglect those nondiscriminative ones. We further visualized the CAM [101] on NICO in Figure 7, which indeed shows that IPIRM helps the classifier focus on the foreground regions.
FewShot Tasks. As shown in Table 4, our IPIRM significantly improves the performance of 5way5shot setting, e.g., we outperform the baseline MoCov2 by 2.2%. This is because IPIRM can further disentangled over SSL, which is essential for representations to generalize to different downstream class domains (recall Corollary 2 of Lemma 1). This is also in line with recent works [88] showing that a disentangled representation is especially beneficial in lowshot scenarios, and further demonstrates the importance of disentanglement in downstream tasks.
6 Conclusion
We presented an unsupervised disentangled representation learning method called Iterative Partitionbased Invariant Risk Minimization (IPIRM), based on SelfSupervised Learning (SSL). IPIRM iteratively partitions the dataset into semanticrelated subsets, and learns a representation invariant across the subsets using SSL with an IRM loss. We show that with theoretical guarantee, IPIRM converges with a disentangled representation under the grouptheoretical view, which fundamentally surpasses the capabilities of existing SSL and fullysupervised learning. Our proposed theory is backed by strong empirical results in disentanglement metrics, SSL classification accuracy and transfer performance. IPIRM achieves disentanglement without using generative models, making it widely applicable on largescale visual tasks. As future directions, we will continue to explore the application of group theory in representation learning and seek additional forms of inductive bias for faster convergence.
Acknowledgments and Disclosure of Funding
The authors would like to thank all reviewers for their constructive suggestions. This research is partly supported by the AlibabaNTU Joint Research Institute, the A*STAR under its AME YIRG Grant (Project No. A20E6c0101), and the Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 2 grant.
References
 [1] (1972) More is different. Science. Cited by: §1.
 [2] (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893. Cited by: §C.3.1, §1, §3, Figure 4, §5.1, Table 1.
 [3] (2019) Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: §2.
 [4] (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §1.
 [5] (2009) Learning deep architectures for ai. Now Publishers Inc. Cited by: §2.

[6]
(2014)
Food101–mining discriminative components with random forests
. InEuropean conference on computer vision
, Cited by: §5.3.  [7] (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §C.2.1.
 [8] (1993) Signature verification using a" siamese" time delay neural network. Advances in neural information processing systems 6, pp. 737–744. Cited by: §2.
 [9] (2020) Language models are fewshot learners. In Advances in Neural Information Processing Systems, Cited by: §1.
 [10] (2018) Understanding disentangling in vae. arXiv preprint arXiv:1804.03599. Cited by: Table 12, §5.1, Table 1.
 [11] (2019) Learning disentangled semantic representation for domain adaptation. In IJCAI: proceedings of the conference, Cited by: §2.

[12]
(2018)
Deep clustering for unsupervised learning of visual features
. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §2.  [13] (2019) Unsupervised pretraining of image features on noncurated data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2959–2968. Cited by: §2.
 [14] (2020) Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882. Cited by: §2.
 [15] (2018) Zeroshot visual recognition using semanticspreserving adversarial embedding networks. In CVPR, Cited by: §5.1.

[16]
(2018)
Isolating sources of disentanglement in variational autoencoders
. In Advances in neural information processing systems, Cited by: Table 12, §2, §5.1, Table 1.  [17] (2020) A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. Cited by: Table 11, Table 12, Figure 2, §1, §1, §2, Figure 4, Figure 7, §5.1, §5.2, Table 1, Table 2.
 [18] (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, Cited by: §2.
 [19] (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: Table 13, Table 14, Table 15, §2, Figure 7, Figure 7, Table 4.
 [20] (2020) Exploring simple siamese representation learning. arXiv preprint arXiv:2011.10566. Cited by: §C.3.1, §2, Figure 7.
 [21] (2020) Debiased contrastive learning. arXiv preprint arXiv:2007.00224. Cited by: §C.3.1, Table 11, §2, §5.2, Table 2.

[22]
(2014)
Describing textures in the wild.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, Cited by: §5.3. 
[23]
(2011)
An analysis of singlelayer networks in unsupervised feature learning.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, pp. 215–223. Cited by: Figure 2, Figure 4, §5.2, Table 2.  [24] (2018) Spherical cnns. In ICLR, Cited by: §2.
 [25] (2014) Learning the irreducible representations of commutative lie groups. In International Conference on Machine Learning, Cited by: §2.
 [26] (2009) Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §C.5, Table 13, Table 14, Table 15, §5.3, Table 4.
 [27] (2019) BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Cited by: §1.

[28]
(2020)
Theory and evaluation metrics for learning disentangled representations
. In International conference on learning representations, Cited by: §2.  [29] (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422–1430. Cited by: §2.
 [30] (2018) A framework for the quantitative evaluation of disentangled representations. In International conference on learning representations, Cited by: §C.2.1, §2, §5.1.
 [31] (2021) How Well Do SelfSupervised Models Transfer?. In CVPR, Cited by: §C.5.1, §C.5, Table 13, §5.3, §5.3.
 [32] (2010) The pascal visual object classes (voc) challenge. International journal of computer vision. Cited by: Table 13.
 [33] (2004) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, Cited by: §5.3.
 [34] (2016) Domainadversarial training of neural networks. The Journal of Machine Learning Research. Cited by: §1.
 [35] (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §2.
 [36] (2020) Bootstrap your own latent: a new approach to selfsupervised learning. arXiv preprint arXiv:2006.07733. Cited by: §2.
 [37] (1991) Representation theory: a first course. External Links: ISBN 9780387974958 Cited by: §1, §2.
 [38] (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: Table 13, Table 14, Table 15, §1, §2, Figure 7, §5.3, Table 4.
 [39] (2021) Towards noniid image classification: a dataset and baselines. Pattern Recognition 110, pp. 107383. Cited by: §C.4.1, Table 10, Figure 7, §5.3.
 [40] (2020) Dataefficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pp. 4182–4192. Cited by: §2.
 [41] (2017) Betavae: learning basic visual concepts with a constrained variational framework. In ICLR, Cited by: §2.
 [42] (2018) Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230. Cited by: SelfSupervised Learning Disentangled Group Representation as Feature, §1, §2.
 [43] (2017) Betavae: learning basic visual concepts with a constrained variational framework. International conference on learning representations. Cited by: Table 12, §5.1, Table 1.

[44]
(2018)
Learning deep representations by mutual information estimation and maximization
. arXiv preprint arXiv:1808.06670. Cited by: §2.  [45] (2018) Learning to decompose and disentangle representations for video prediction. In Advances in neural information processing systems, Cited by: §2.
 [46] (2020) AdCo: adversarial contrast for efficient learning of unsupervised representations from selftrained negative adversaries. arXiv preprint arXiv:2011.08435. Cited by: §2.
 [47] (2019) Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, Cited by: §1.
 [48] (2018) Understanding image motion with group representations. Cited by: §2.
 [49] (1994) Abstract algebra: theory and applications (the prindle, weber & schmidt series in advanced mathematics). Prindle Weber & Schmidt. Cited by: §2, §4, footnote 1.
 [50] (2020) Hard negative mixing for contrastive learning. arXiv preprint arXiv:2010.01028. Cited by: §C.3.1, Table 11, §2, §5.2, Table 2.
 [51] (2015) Bayesian representation learning with oracle constraints. International conference on learning representations. Cited by: §2.
 [52] (2018) Disentangling by factorising. In International Conference on Machine Learning, Cited by: Table 12, §2, §5.1, §5.1, Table 1.
 [53] (2014) Autoencoding variational bayes. In ICLR, Cited by: Table 12, §5.1, Table 1.
 [54] (2019) Do better imagenet models transfer better?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §5.3, §5.3.
 [55] (2009) Learning multiple layers of features from tiny images. Cited by: §5.2, Table 2.
 [56] (2012) Learning multiple layers of features from tiny images. Cited by: §5.3.
 [57] (2021) Imix: a domainagnostic strategy for contrastive representation learning. In ICLR, Cited by: §C.3.1, §5.2, Table 2.
 [58] (2020) Prototypical contrastive learning of unsupervised representations. arXiv preprint arXiv:2005.04966. Cited by: Table 13, Table 14, Table 15, §2, Figure 7, Table 4.
 [59] (2020) Infogancr and modelcentrality: selfsupervised model training and selection for disentangling gans. In International Conference on Machine Learning, Cited by: §2.
 [60] (2020) Disentangling factors of variations using few labels. In 8th International Conference on Learning Representations (ICLR), Cited by: §2.
 [61] (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, Cited by: §C.2.1, §C.2.1, §2, §5.1.
 [62] (2013) Finegrained visual classification of aircraft. arXiv preprint arXiv:1306.5151. Cited by: §5.3.
 [63] (2020) Selfsupervised learning of pretextinvariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717. Cited by: Table 13, Table 14, Table 15, §2, Figure 7, Table 4.
 [64] (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, Cited by: §5.3.
 [65] (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pp. 69–84. Cited by: §2.
 [66] (2020) Elasticinfogan: unsupervised disentangled representation learning in classimbalanced data. In Advances in neural information processing systems, Cited by: §2.
 [67] (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.
 [68] (2018) Learning independent causal mechanisms. In Proceedings of the 35th International Conference on Machine Learning, pp. 4036–4044. Cited by: §1.
 [69] (2012) Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, Cited by: §5.3.
 [70] (2020) Learning disentangled representations and group structure of dynamical environments. Advances in Neural Information Processing Systems. Cited by: §2.
 [71] (1999) Learning lie groups for invariant visual perception. Advances in neural information processing systems. Cited by: §1.
 [72] (2014) Learning to disentangle factors of variation with manifold interaction. In International conference on machine learning, Cited by: §2.
 [73] (2021) Do generative models know disentanglement? contrastive learning is all you need. arXiv preprint arXiv:2102.10543. Cited by: §2.
 [74] (2018) Learning deep disentangled embeddings with the fstatistic loss. In Advances in neural information processing systems, Cited by: §C.2.1, §2, §5.1.
 [75] (2020) Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592. Cited by: §2.
 [76] (2012) On causal and anticausal learning. In Proceedings of the 29th International Conference on Machine Learning, Cited by: §1.
 [77] (2015) Very deep convolutional networks for largescale image recognition. In 3rd International Conference on Learning Representations, Cited by: §C.1, Figure 2.
 [78] (2020) Scaleequivariant steerable networks. Cited by: §2.
 [79] (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §C.1, Figure 2.
 [80] (2018) Interventional robustness of deep latent variable models. arXiv. Cited by: §2.
 [81] (2019) Robustly disentangled causal mechanisms: validating deep representations for interventional robustness. In International Conference on Machine Learning, Cited by: §C.2.1, §2, §5.1.
 [82] (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §C.3.1, Table 11, §2, §5.3.
 [83] (2020) Contrastive multiview coding. In European conference on computer vision, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cited by: Figure 2.
 [84] (2020) What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243. Cited by: §2.
 [85] (2021) Understanding selfsupervised learning dynamics without contrastive pairs. arXiv preprint arXiv:2102.06810. Cited by: §2.

[86]
(2017)
Disentangled representation learning gan for poseinvariant face recognition
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.  [87] (2008) Visualizing data using tsne.. Journal of machine learning research 9 (11). Cited by: Figure 4.
 [88] (2019) Are disentangled representations helpful for abstract visual reasoning?. In Advances in neural information processing systems, Cited by: §D.4, §2, §5.3.
 [89] (2021) Causal attention for unbiased visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1.
 [90] (2020) Unsupervised feature learning by crosslevel discrimination between instances and groups. arXiv preprint arXiv:2008.03813. Cited by: §2, §5.2.
 [91] (2019) Deep scalespaces: equivariance over scale. In NeurIPS, Cited by: §2.
 [92] (2018) Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: Table 13, Table 14, Table 15, §2, Figure 7, Table 4.

[93]
(2010)
Sun database: largescale scene recognition from abbey to zoo
. In 2010 IEEE computer society conference on computer vision and pattern recognition, Cited by: §5.3.  [94] (2021) What should not be contrastive in contrastive learning. In International Conference on Learning Representations, Cited by: §4.
 [95] (2015) A largescale car dataset for finegrained categorization and verification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.3.
 [96] (2019) Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6210–6219. Cited by: §2.
 [97] (2021) Counterfactual zeroshot and openset visual recognition. In CVPR, Cited by: §1, §5.1.
 [98] (2020) Interventional fewshot learning. In NeurIPS, Cited by: §D.5, item 4.
 [99] (2020) Measuring disentanglement: a review of metrics. arXiv preprint arXiv:2012.09276. Cited by: §C.2.1.
 [100] (2018) Mixup: beyond empirical risk minimization. In ICLR, Cited by: Table 2.

[101]
(2016)
Learning deep features for discriminative localization
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: Figure 7, §5.3.  [102] (2020) S3VAE: selfsupervised sequential vae for representation disentanglement and data generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2.

[103]
(2014)
Multiview perceptron: a deep model for learning face identity and view representations
. Advances in Neural Information Processing Systems 27 (NIPS 2014). Cited by: §2.
Appendix A Preliminaries
A group is a set together with a binary operation, which takes two elements in the group and maps them to another element. For example, the set of integers is a group under the binary operation of plus. We formalize the notion through the following definition.
Binary Operation. A binary operation on a set is a function mapping into . For each , we denote the element by .
Group. A group is a set , closed under a binary operation , such that the following axioms hold:

[leftmargin=+.2in]

Associativity. , we have .

Identity Element. , such that , .

Inverse. , , such that .
Groups often arise as transformations of some space, such as a set, vector space, or topological space. Consider an equilateral triangle. The set of clockwise rotations w.r.t. its centroid to retain its appearance forms a group , with the last element corresponding to an identity mapping. We say this group of rotations act on the triangle, which is formally defined below.
Group Action. Let be a group with binary operation and be a set. An action of on is a map so that and , where , denotes functional composition. , denote as .
In our formulation, we have a group acting on the semantic space . For example, consider the color semantic, which can be mapped to a circle representing the hue. Hence the group acting on it corresponds to rotations, similar to the triangle example, e.g., may correspond to rotating a color in clockwise by . In the context of representation learning, we are interested to learn a feature space to reflect , formally defined below.
Group Representation. Let be a group. A representation of (or representation) is a pair , where is a vector space and is a group action, i.e., for each , is a linear map.
Intuitively, each corresponds to a linear map, i.e., a matrix that transforms a vector to . Finally, there is a decomposition of semantic space and the group acting on it in our definition of disentangled representation. The decomposition of semantic space is based on the Cartesian product . A similar concept is defined w.r.t. group.
Direct Product of Group. Let be groups with the binary operation . Let for . Define to be the element . Then or is the direct product of the groups under the binary operation .
With this, we can formally define is only affected by the action of and fixed by the action of other subgroups: is a trivial subrepresentation (“fixed”), i.e., for each , is the identity mapping , and is nontrivial (“affected”).
Appendix B Proof
b.1 Proof of Definition 2
Defines a Partition of . We will show that defines an equivalence relation on , which naturally leads to a partition of . For , let if and only if such that . We show that satisfies the three properties of equivalence relation. 1) Reflexive: , we have , hence . 2) Symmetric: Suppose , i.e., for some . Then , i.e., . 3) Transitive: if and , then and for some . Hence and .
Number of Orbits. Recall that acts transitively on (see Section 4). We consider the nontrivial case where the action of is faithful, i.e., the only group element that maps all to itself is the identity element . Let . We will show that each corresponds to a unique orbit. 1) , . Suppose , such that for some , the action of on each corresponds to the identity mapping. One can show that for every different orbit, i.e., , the action of on each is also identity mapping. As partitions into orbits w.r.t. , this means that the action of is identity mapping on all , which contradicts with the action of being faithful. 2) The previous step shows that nonidentity group elements in lead to a different orbit. We need to further show that these orbits are unique, i.e., , if , then . Suppose , i.e., , so , where is the point stabilizer of . As the action of is faithful, . Hence implies .
b.2 Details of Lemma 1
We will first prove Lemma 1 by showing the representation is equivariant, followed by showing that and are decomposable and finally showing that is not decomposable. We will then present more details on the 4 corollaries.
Proof of equivariant. Suppose that the training loss is minimized, yet for . Let in the denominator, and we have , where is the angle between the two vectors. When , . So keeping constant (i.e., the same regularization penalty such as L2), can be further reduced if , which reduces the training loss. This contradicts with the earlier assumption. Hence by minimizing the training loss, we can achieve sampleequivariant, i.e., different samples have different features. Note that this does not necessarily mean groupequivariant. However, the variation of training samples is all we know about the group action of , and we establish that the action of is transitive on , hence we use the sampleequivariant features as the approximation of equivariant features.
Proof of Decomposability between and . Recall the semantic representation , which is show to be equivariant in the previous step. Consider a nondecomposable representation where is affected by the action of both and . Let , where both subspaces are affected by the action of the two groups. In particular, denote the semantic representation , where is affected by the action of (recall that affects and through the equivariant map in Figure 1) and is affected by the action of . From here, we will construct a representation where is only affected by the action of with a lower training loss.
Specifically, we aim to assign a to the th orbit, which is given by:
(4) 
where is the value of for th orbit. Now define given by . Using this new has two outcomes:
1) in the numerator is the linear combination of the dot similarity induced from and . And the dot similarity induced from is increased, as inside each orbit, the value in is the same (maximized similarity);
2) The denominator is now reduced. This is because the denominator is proportional to , and we have already selected the best set that minimizes the expected dot similarities across orbits.
As the inorbit dot similarity increases (numerator), and the crossorbit dot similarity decreases (denominator), the training loss is reduced by decomposing a separate subspace affected only by the action of with . Furthermore, note that a linear projector is used in SSL to project the features into lower dimensions, and a linear weight is used in SL. To isolate the effect of to maximize the similarity of inorbit samples (numerator) and exploit the action of to minimize the similarity of crossorbit samples (denominator), the effect of and on must be separable by a linear layer, i.e., decomposable. Combined with the earlier proof that is only affected by the action of , without loss of generality, we have the decomposition affected by and , respectively.
Proof of NonDecomposability of . We will show that for a representation with decomposed, there exists a nondecomposable representation that achieves the same expected dot similarity, hence having the same contrastive loss. Without loss of generality, consider acting on the semantic attribute space , respectively. Let be a decomposable representation such that there exists feature subspaces affected only by the action of , respectively. Denote . Now we define a nondecomposable representation with mapping and , given by and . Now for any pair of samples with semantic and , and , the dot similarity induced from the subspace is given by . Therefore, the decomposed and nondecomposed representations yield the same expected dot similarity over all pairs of samples, and have the same contrastive loss.
Corollary 1. This follows immediately from the proof on equivariance.
Corollary 2. The same proof above holds for the SL case with , where is the set of classifier weights. In this view, each sample in the class can be seen as an augmented view (augmented by shared attributes such as view angle, pose, etc) of the class prototype. In downstream learning, the shared attributes are not discriminative, hence the performance is affected mostly by . For example, if the groups corresponding to “species” and “shape” act on the same feature subspace (entangled), such that “species”=“bird” always have “shape”=“streamlined” feature, this representation does not generalize to downstream tasks of classifying birds without streamlined shape (e.g., “kiwi”).
Corollary 3. In SL and SSL, the model essentially receives supervision on attributes that are not discriminative towards downstream tasks, through augmentations and inclass variations, respectively. The group acts on the semantic space of these attributes, hence determines the amount of supervision received. With a large , the model filters off more irrelevant semantics and more accurately describe the differences between classes. Note that the standard image augmentations in SSL are also used in SL, making even larger in SL.
Corollary 4. When the number of samples in some orbit(s) is smaller than , this has two consequences that prevent disentanglement: 1) The equivariance is not guaranteed as the training samples do not fully describe . 2) The decomposability is not guaranteed as the decomposed in the previous proof only generalizes to the seen combination of the value in .
b.3 Proof of Theorem 1
We will first revisit the Invariant Risk Minimization (IRM). Let be the image space, the feature space, the classification output space (e.g.
, the set of all probabilities of belonging to each class), the feature extractor backbone
and the classifier . Let be a set of training environments, where each is a set of images. IRM aims to solve the following optimization problem:(5) 
where is the empirical classification risk in the environment using backbone and classifier . Conceptually, IRM aims to find a representation such that the optimal classifier on top of is the same for all environments. As Eq. (5) is a challenging, bileveled optimization problem, it is initiated into the practical version:
(6) 
where is the regularizer balancing between the ERM term and invariant term.
The above IRM is formulated for supervised training. In SSL, there is no classifier mapping from . Instead, there is a projector network mapping features to another feature space , and Eq. (1) is used to compute the similarity with positive key (numerator) and negative keys (denominator) in . Note that in SSL is not equivalent to in SL, as itself does not generate the probability output like , rather, the comparison between positive and negative keys does.
In fact, the formulation of contrastive IRM is given by Corollary 2 of Lemma 1, which says that SL is a special case of contrastive learning, and the set of all classifier weights is the positive and negative key space. In IRM with SL, we are trying to find a set of weights from the classifier weights space (e.g., with feature dimension as and number of classes as ) that achieves invariant prediction. Hence in IRM with SSL, we are trying to find a set of keys from the key space (e.g., with being the dimension of and number of positive and negative keys as ) that achieves invariant prediction by differentiating a sample with negative keys (Note that the similarity with positive keys is maximized and fixed using standard SSL training by decomposing augmentations and other semantics as in Lemma 1). Specifically, in IPIRM, the 2 subsets in each partition form the set of training environments .
Proof of the Sufficient Condition. Suppose that the representation is fully disentangled w.r.t. . By Definition 1, there exists subspace affected only by the action of . For each partition given by