If you are reading this paper on a computer, press one of the arrow keys… all the text you are reading jumps as the screen refreshes in response to your action. Now imagine you were playing a video game like Atari’s Space Invaders—the same keystroke would cause a small sprite at the bottom of your screen to move in response. These actions induce changes in pixels that are very different, but in both cases, the visual feedback in response to our actions indicates the presence of some object on the screen—a virtual paper and a virtual spacecraft, respectively—with properties that we can manipulate. Our keystrokes induce sparse changes to a program’s state, and these changes are reflected on the screen, albeit not necessarily in a correspondingly sparse way (e.g., most pixels change when scrolling). Similarly, many of our interactions with the real world induce sparse changes to the underlying causal factors of our environment: lift a coffee cup and the cup moves, but not the rest of the objects on your desk; turn your head laterally, and the coordinates of all the objects in the room shift, but only in the horizontal direction. These examples hint at the main question we aim to answer in this paper: if we know that actions have sparse effects on the latent factors of our system, can we use that knowledge as weak supervision to help disentangle these latent factors from pixel-level data?
on the ImageNet challenge(Deng et al., 2009) without using any of the labels.
Yet, despite these advances, these systems are still far from human reasoning abilities and often fail on out-of-distribution examples (Geirhos et al., 2020). To robustly generalize out of distribution, we need models that can infer the causal mechanisms that relate latent variables (Schölkopf et al., 2021; Schölkopf and von Kügelgen, 2022) because these mechanisms are invariant under distribution shift. The field of causal inference has developed theory and methods to infer causal mechanisms from data (Pearl, 2009; Peters et al., 2017)
, but these methods assume access to high-level abstract features, instead of low-level signal data such as video, text and images. We need representation learning methods that reliably recover these abstract features if we are to bridge the gap between causal inference and deep learning.
This is a challenging task because the problem of inferring latent variables is not identified with independent and identically distributed (IID) data (Hyvärinen and Pajunen, 1999; Locatello et al., 2019), even in the limit of an infinite number of such IID examples. However, there has been significant recent progress in developing representation learning approaches that provably recover latent factors (e.g., object positions, object colors, etc.) underlying complex data (e.g. image), where , by going beyond the IID setting and using observations of along with minimal domain knowledge and supervision (Hyvarinen and Morioka, 2016, 2017; Locatello et al., 2020; Khemakhem et al., 2020a). These works establish provable identification of latents by leveraging strong structural assumptions such as independence conditional on auxiliary information (e.g., timestamps). In this work, we aim to relax these distributional assumptions on the latent variables to achieve identification for arbitrary continuous latent distributions. Instead of distributional assumptions, we assume access to data generated under sparse perturbations that change only a few latent variables at a time as a source of weak supervision. Figure 1 illustrates our working example of this assumption: a simple environment where an agent’s actions perturb the coordinates of a few balls at a time. Our main contributions are summarized as follows.
We show that sparse perturbations that impact one latent at a time are sufficient to learn the latents (up to permutation and scaling) that follow any unknown continuous distribution.
Next, we consider more general settings, where perturbations impact one block of latent variables at a time. In the setting where blocks do not overlap, we recover the latents up to an affine transformation of these blocks.
Further, we show that when perturbation blocks overlap, we get stronger identification. In this setting, we prove identification up to affine transformation of the smallest intersecting block. Consequently, if there are blocks that intersect in one latent variable only, then such latents are identified up to permutation and scaling.
We leverage these results to propose a natural estimation procedure and experimentally illustrate the theoretical claims on low-dimensional synthetic and high-dimensional image-based data.
2 Related works
Many of the works on provable identification of representations trace their roots to non-linear ICA (Hyvärinen and Pajunen, 1999). Hyvarinen and Morioka (2016, 2017) were the first to use auxiliary information in the form of timestamps and additional structure on the latent evolution to achieve provable identification. Since then, these works have been generalized in many exciting ways. Khemakhem et al. (2020a) assume independence of latents conditional on auxiliary information, and several of these assumptions were further relaxed by Khemakhem et al. (2020b).
Our work builds on the machinery developed we developed in Ahuja et al. (2022). There we showed that if we know the mechanisms that drive the evolution of latents, then the latents are identified up to equivariances of these mechanisms. However, we left the question of achieving exact identification without such knowledge open. Here we consider a class of mechanisms where an agent’s actions impact the latents through unknown perturbations. We show how to achieve identification by exploiting the sparsity in the perturbations. This class of perturbations was first leveraged to prove identification by Locatello et al. (2020). However, Locatello et al. assume that the latents are independent, whereas we make no assumptions on the distribution other than continuity. Our work also connects to an insightful line of work on multi-view ICA (Gresele et al., 2020). Gresele et al. assume independence of latents and prove identification under multiple views of the same latent through multiple decoders.
Klindt et al. (2021) and Lachapelle et al. (2022) exploit different forms of sparsity in time-series settings to attain identification. Both works require assumptions on the parametric form of the latents (e.g., Laplacian, conditional exponential), auxiliary information observed (e.g., actions, timestamp), and the structure of the graphical model dictating the interactions between the latents and auxiliary information to arrive at identification. Yao et al. (2021) and Lippe et al. (2022) model the latent evolution as a structural causal model unrolled in time. Yao et al. exploit non-stationarity and sufficient variability dictated by the auxiliary information to provide identification guarantees. Lippe et al. exploit causal interventions on the latents to provide identification guarantees but require the knowledge of intervention targets and assume an invariant causal model describing the relations between any adjacent time frames. In concurrent work, Brehmer et al. (2022) leverage data generated under causal interventions as a source of weak supervision and prove identification for structural causal models that are diffeomorphic transforms of exogenous noise. In addition to the above, there are a number of recent papers that explain the success of self-supervised contrastive learning through the lens of identification of representations. Zimmermann et al. (2021) showed that encoders minimizing contrastive losses identify the latents generated from distributions such as the von Mises-Fisher distribution. Von Kügelgen et al. (2021) depart from the distributional assumptions made by Zimmermann et al. (2021) and show that data augmentations filter out “nuisance” from the semantically relevant content to achieve blockwise identification.
3 Latent identification under sparse perturbations
Data Generation Process
We start by describing the data generation process used for the rest of the work. There are two classes of variables we consider – a) unobserved latent variables and b) observed variables . The latent variables are sampled from a distribution and then transformed by a map , where is injective and analytic222A analytic function, , is an infinitely differentiable function such that for all in its domain, the Taylor series evaluated at converges pointwise to , to generate . We write this as follows
are realizations of the random variablesand respectively. It is impossible to invert just from the realizations of (Hyvärinen and Pajunen, 1999; Locatello et al., 2019). Most work has gone into understanding how structure of latents and auxiliary information (e.g., timestamps, weak labels) play a role in solving the above problem. In this work, we depart from these assumptions and instead investigate the role of data generated under perturbations of latents to achieve identification. Define the set of perturbations as
and the corresponding perturbation vectors as, where is the perturbation. Each latent is sampled from an arbitrary and unknown distribution . The same set of unknown perturbations in are applied to each to generate perturbed latents per sampled and the corresponding observed vectors . Each of these latents are transformed by the map and we observe . Our goal is to use these observations and estimate the underlying latents. We summarize this data generation process (DGP) in the following assumption.
The DGP follows
where is injective and analytic, and is a continuous random vector with full support over . 333The assumption on the support of can be relaxed.
To better understand the above DGP, let us turn to some examples. Consider a setting where an agent is interacting with an environment containing several balls (See Figure 1). The latent captures the properties of the objects; for example, in Figure 1, just captures the positions of each ball, but in general it could include more properties such as velocity, shape, color, etc.. The agent perturbs the objects in the scene by , which can modify a single property associated with one object or multiple properties from one or more objects depending on how the agent acts. Note that when the agent perturbs a latent, it can lead to downstream effects. For instance, if the agent moves a ball to the edge of the table, the ball falls in subsequent frames. For this work, we only consider the observations just before and after the perturbation and not the downstream effects. In the Appendix, we explain these downstream effects using structural causal models (See Section 7.2). We also explain the connection between the perturbations in equation (2) (based on Locatello et al. (2020)) and causal interventions. The above example is typical of a reinforcement learning environment, other examples include natural videos with sparse changes (e.g., MPI3D data (Gondal et al., 2019)).
In the above DGP in equation (2), we assumed that for each scene there are multiple perturbations. It is possible to extend our results to settings where we perturb each scene only once, given a sufficiently diverse set of perturbations, i.e., for a small neighborhood of a scene around , each scene in the neighbourhood receives a different perturbation. We compare these two approaches experimentally.
The learner’s objective is to use the observed samples generated by the DGP in Assumption 1 and learn an encoder that inverts the function and recovers the true latents. For each observed sample , the learner compares all the pairs pre- and post-perturbation. For every unknown perturbation used in the DGP in equation (2), the learner guesses the perturbation and enforces that the latents predicted by the encoder for and are consistent with the guess. We write this as generated by DGP in (2)
We denote the set of guessed perturbations as , where is the guess for perturbation . We can turn the above identity into a mean square error loss given as
where the expectation is taken over observed samples generated by the DGP in (2) and the minimization is over all the possible maps and perturbation guesses in the set . Note that a trivial solution to the above problem is an encoder that maps everything to zero, and all guesses equal zero. In the next section, we get rid of these trivial solutions by imposing an additional requirement that the span of the set is . It is worth pointing out that we do not restrict the set of ’s to injective maps in theory and experiments. We denote the latent estimated by the encoder for a point as . It is related to the true latent as follows , where is some function that relates true to estimated . In the next section, we show that if perturbations are diverse, then is an affine transform. Further, we show that if perturbations are sparse, then takes an even simpler form.
3.1 Sparse perturbations
We first show that it is possible to identify the true latents up to an affine transformation without any sparsity assumptions. Later, we leverage sparsity to strengthen identification guarantees.
The dimension of the span of the perturbations in equation is , i.e., .
The above assumption implies that the perturbations are diverse. We now state a regularity condition on the function .
is an analytic function. For each component of and each component of , define the set , where is a fixed vector in . Each set has a non-zero Lebesgue measure in .
If we restrict the encoder to be analytic, then is analytic since is also analytic, thus satisfying the first part of the above assumption. The second part of the above assumption can be understood as follows: suppose we have a scalar valued function that is differentiable. If we expand around , by the mean value theorem we get , where . If we vary to take all the values in , then also varies. The above assumption states that the set of has a non-zero Lebesgue measure. Under the above assumptions, we show that an encoder that solves equation (3) identifies true latents up to an affine transform, i.e., , where is a matrix and is an offset.
The proof of above proposition follows the proof technique from Ahuja et al. (2022), for further details refer to the Appendix (Section 7.1). We interpret the above result in the context of the agent interacting with balls (as shown in Figure 1), where the latent vector captures the and coordinates of the . Under each perturbation, the balls move along the vector dictated by the perturbation. If there are at least perturbations, then the latents estimated by the learned encoder are guaranteed to be an affine transformation of the actual positions of the balls.
3.1.1 Non-overlapping perturbations
The perturbations in are one-sparse, i.e., each has one non-zero component.
Next, we show that under one-sparse perturbations, the latents estimated identify true latents up to permutation and scaling.
If Assumptions 1-4 hold and the number of perturbations per example equals the latent dimension, , 444We can relax this condition to , refer to the Appendix (Section 7.2) for details. then the encoder that solves equation (3) (with as one-sparse and ) identifies true latents up to permutation and scaling, i.e. , where is an invertible diagonal matrix, is a permutation matrix and is an offset.
For the proof of above theorem, refer to Section 7.1 in the Appendix. The theorem does not require that learner knows either the identity or amount each component changed. However, the learner has to use one-sparse perturbations as guesses. Suppose the learner does not know that the actual perturbations are one-sparse and instead uses guesses that are -sparse, i.e., latents change at one time. In such a case, the and true are related to each other through a permutation and block diagonal matrix, i.e., we can replace in the above result to be a block diagonal matrix instead of a diagonal matrix (see Section 7.2 in the Appendix for details). In the context of the ball agent interaction environment from Figure 1, the above result states that provided the agent interacts with one coordinate of each ball at a time, it is possible to learn the position of each ball up to scaling errors.
We now consider a natural extension of the setting above, where the perturbations simultaneously operate on blocks of latents. In the ball agent interaction environment, this can lead to multiple scenarios – i) the agent interacts with one ball at a time but perturbs both coordinates simultaneously, ii) the agent interacts with several balls simultaneously.
Consider a perturbation (from equation (2)). We define the block of latents that is impacted under perturbation as , where is the component of . We group the perturbations in based on the block they act upon, i.e. perturbations in the same group act on the same block of latents. Define the set of the groups corresponding to perturbations in as . Define the set of corresponding blocks as , where is the block impacted by perturbations in group . If partitions the set of latent components indexed , then it implies all the distinct blocks are non-overlapping. We formally define this below.
Blockwise and non-overlapping perturbations. If the the set of blocks corresponding to perturbations form a partition of , then is said to be blockwise and non-overlapping. Formally stated, any two distinct do not intersect, i.e., , and .
From the above definition it follows that two perturbations either act on the same block or completely different blocks with no overlapping variables.
The learner knows the group label for each perturbation . Therefore, any two perturbations in associated with same group in impact the same block of latents.
We illustrate the above Assumptions 5, 6 in the following example. Consider the ball agent interaction environment (Figure 1). is the vector of positions of all balls, where is the coordinate of ball . If the agent randomly perturbs ball , then it changes the block . We would call such a system -sparse. All the perturbations on ball are in one group. Since the agent knows the group of the perturbation, it does not know the ball index but it knows whenever we interact with the same ball.
If the latent variables recovered , where is a permutation matrix and is a block-diagonal matrix, then the latent variables are said to be recovered up to permutations and block-diagonal transforms.
In the theorem that follows, we show that under the assumptions made in this section, we achieve identification up to permutations and block-diagonal transforms with invertible blocks.
For the proof of the above theorem, refer to Section 7.1 in the Appendix. From the above theorem, we gather that the learner can separate the perturbed blocks. However, the latent dimensions within the block are linearly entangled. In the ball agent interaction with -sparse perturbations, the above theorem implies that the agent can separate each ball out but not their respective and coordinates. In the above theorem, we require the learner to know the group of each intervention (Assumption 6). In Section 7.2 in the Appendix, we discuss ideas on how to relax this assumption.
3.1.2 Overlapping perturbations
In the previous section, we assumed that the blocks across different perturbations are non-overlapping. This section relaxes this assumption and allows the perturbation blocks to overlap. We start with a motivating example to show how overlapping perturbations can lead to stronger identification.
Consider the agent interacting with two balls, where describes the coordinates of the two balls. The agent perturbs the first ball and then perturbs the second ball. For the purpose of this example, assume that these perturbations satisfy the assumptions in Theorem 2. We obtain that the estimated position of each ball is linearly entangled w.r.t the true and coordinates. For the first ball we get . We also have the agent perturb the coordinates of the first and second ball together and then it does the same with the coordinates. We apply Theorem 2 and obtain that the estimated coordinates of each ball are linearly entangled. We write this as . We take a difference of the two relations for to get
Since the above has to hold for all , we get , , and . Thus . Similarly, we can disentangle the rest of the balls.
We take the insights from the above example and generalize them below. Let us suppose that from the set of perturbations we can construct at least two distinct subsets and such that both subsets form a blockwise non-overlapping perturbation (see Definition 1). Perturbations in () partition into blocks () respectively. It follows that there exists at least two blocks and such that . From Theorem 2, we know that we can identify latents in block and up to affine transforms. In the next theorem, we show that we can identify latents in each of the blocks , , up to affine transforms.
Each perturbation in is -sparse. The perturbations in each group span a -dimensional space, i.e., . There exist at least two distinct subsets of perturbations and that are both blockwise and non-overlapping.
For the proof of the above theorem, refer to Section 7.1 in the Appendix. From the above theorem, it follows that if blocks overlap at one latent only, then all such latents are identified up to permutation and scaling. We now construct an example to show the identification of all the latents under overlapping perturbations. Suppose we have a dimensional latent. The set of all contiguous blocks of length is given as follows . Different -sparse perturbations impact these blocks. Observe that every component between to gets to be the first element of a block exactly once and the last element of the block exactly once. As a result, each latent gets to be the only element at the intersection of two blocks. We apply Theorem 3 to this case and get that all the latents are identified up to permutation and scaling. We generalize this example below.
is a set of all the contiguous blocks of length , where . The perturbations in each block span a dimensional space. Further, assume that .
In the above assumption, we construct contiguous blocks such that a blocks of length . The construction ensures that each index in forms the first element of exactly one block and last element of exactly one block. In the next theorem, we show that under the above assumption, we achieve identification up to permutation and scaling.
|C-wise ()||C-wise ()||B-wise ()||B-wise ()|
For the proof of the above theorem, refer to Section 7.1 in the Appendix. The total number of perturbations required in the above theorem is . If we plug , we recover Theorem 1 as a special case. The above result highlights that if the block lengths are larger, then we need to scale the number of perturbations accordingly by the same factor to achieve identification up to permutation and scaling. We assumed a special class of perturbations operating on contiguous blocks. In general, the total number of distinct blocks can be up to . Suppose distinct random blocks of length are selected for perturbations. As grows, we reach a point where each latent component is at the intersection of two blocks from different sets of blockwise non-overlapping perturbations. At that point, we identify all latents up to permutation and scaling.
Data generation processes
We conducted two sets of experiments – low-dimensional synthetic and high-dimensional image-based inputs – that follow the DGP in equation (2). In the low-dimensional synthetic experiments we experimented with two choices for
). We used an invertible multi-layer perceptron (MLP) (withhidden layers) from Zimmermann et al. (2021) for . We evaluated for latent dimensions . The training and test data size was and respectively. For the image-based experiments we used PyGame (Shinners, 2011)’s rendering engine for and generated pixel images that look like those shown in Figure 1. The coordinates of each ball, , were drawn independently from a uniform distribution, . We varied the number of balls from () to 4 (). For these experiments, there was no fixed-size training set; instead the images are generated online and we trained to convergence. Because these problems are high dimensional, we only sampled a single perturbation for each image.
Loss function, architecture, evaluation metrics
In all the experiments we optimized equation (4) with square error loss. The encoder was an MLP with two hidden layers of size for the low-dimensional synthetic experiments and a ResNet-18 (He et al., 2015)
for the image-based experiments. Further training details such as the optimizers used, hyperparameters etc. are in the Appendix (Section 7.3). We used the mean correlation coefficient (MCC)(Hyvarinen and Morioka, 2016) to verify the claims in Theorems 1 and 4. If MCC equals one, then the estimated latents identify true latents up to permutation and scaling. We extend MCC to blockwise MCC (BMCC) to verify the claims in Theorem 2. If BMCC equals one, then the estimated latents identify true latents up to permutation and block-diagonal transforms. Further details are in the Appendix (Section 7.3). The codes to reproduce these experiments can be found at https://github.com/ahujak/WSRL.
We start with results from experiments with one-sparse perturbations. The set consists of one-sparse perturbations that span a dimensional space. In the context of the image experiments, these perturbations correspond to moving each ball individually along a single axis. The learner solves the identity in equation (3) using a set of random one-sparse perturbations that span a dimensional space. In Table 1, we used the low-dimensional synthetic data generating process to compare the effect of (i) applying all perturbations to each instance (following the DGP in (2)), against a more practical setting (ii) where a perturbation is selected uniform at random from and applied to each instance . The results for (i) are shown in black and the results for (ii) are shown in gray font in the C-wise (componentwise) column in Table 1. We observed high MCCs in both settings. The results were similar in the more challenging image-based experiments (see Table 3, C-wise column) with MCC scores for all the settings that we tested, as expected given the results presented in Theorem 1.
Next, we chose the set of perturbations to comprise of -sparse non-overlapping perturbations that span a dimensional space. We repeated the same synthetic experiments as above with one and perturbations per instance. Under these assumptions we should expect to see that pairs of latents are separated blockwise but linearly entangled within the blocks (c.f. Theorem 2). We found this to be the case. The high BMCC numbers in Table 1 displayed under B-wise (blockwise) column (except for and one perturbation per sample) show disentanglement between the blocks of latents. In Figure 2, the first two rows and columns show how the predicted latents corresponding to a block are correlated with their true counterpart (see Predicted vs True ) and the other latent in the block (Predicted vs True and vice versa). The plots in the last column show that the predicted latents did not bear a correlation with a randomly selected latent from outside the block.
In this section, we experimented with blocks of size two that overlap in order to conform with the setting described in Theorem 4. We used the same distributions as before and only changed the type of perturbations. The low-dimensional synthetic results are summarized in Table 2. The results were largely as expected, with a strong correspondence between the predicted and true latents reflected by high MCC values.
On the image datasets (see Table 3), we found that the MCC scores depended on both the number of balls and how the blocks were selected. We compared two strategies for selecting blocks of latents to perturb: either select uniformly from all adjacent pairs ( blocks), or uniformly from all combinations of latent indices, ( blocks). The latter lead to higher MCC scores (ranging from to ) as it placed more constraints on the solution space. The dependence on the number of balls is more surprising. To investigate the implied entanglement from the lower MCC scores, we evaluated trained encoders on images where we kept balls in a fixed location and moved one of the balls (see Section 7.3 in the Appendix for example images). If the coordinates were perfectly disentangled, the encoder should predict no movement for static balls. When the moving ball shifted by units, the predicted location of the static balls shifted by and the moving ball shifted units. We further verified this claim and ran blockwise experiments with balls () and got MCC scores of and for and blocks respectively. In the Appendix (Section 7.3), we show that this solution is a stationary point, and we achieve a perfect MCC of one when .
5 Discussion and limitations
Our work presents the first systematic analysis of the role of sparsity in achieving latent identification under unknown arbitrary latent distributions. We assume that every sample (or at least every neighborhood of a sample) experiences the same set of perturbations. A natural question is how to extend our results to settings where this assumption may not hold. Data augmentation provides a rich source of perturbations; our results cover translations, but extending them to other forms of augmentation is an important future direction. We followed the literature on non-linear ICA (Hyvarinen et al., 2019) and made two assumptions – i) the map that mixes latents is injective, and ii) the dimension of the latent is known. We believe future works should aim to relax these assumptions.
Kartik Ahuja acknowledges the support provided by the IVADO postdoctoral fellowship funding program. Jason Hartford acknowledges the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). The authors also acknowledge the funding from Recursion Pharmaceuticals and CIFAR.
- Ahuja et al. (2022) Ahuja, K., Hartford, J., and Bengio, Y. (2022). Properties from mechanisms: an equivariance perspective on identifiable representation learning. In International Conference on Learning Representations.
- Brehmer et al. (2022) Brehmer, J., De Haan, P., Lippe, P., and Cohen, T. (2022). Weakly supervised causal representation learning. arXiv preprint arXiv:2203.16437.
- Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In , pages 248–255. Ieee.
- Geirhos et al. (2020) Geirhos, R., Jacobsen, J., Michaelis, C., Zemel, R. S., Brendel, W., Bethge, M., and Wichmann, F. A. (2020). Shortcut learning in deep neural networks. CoRR, abs/2004.07780.
- Gondal et al. (2019) Gondal, M. W., Wuthrich, M., Miladinovic, D., Locatello, F., Breidt, M., Volchkov, V., Akpo, J., Bachem, O., Schölkopf, B., and Bauer, S. (2019). On the transfer of inductive bias from simulation to the real world: a new disentanglement dataset. Advances in Neural Information Processing Systems, 32.
Gresele et al. (2020)
Gresele, L., Rubenstein, P. K., Mehrjou, A., Locatello, F., and Schölkopf,
The incomplete rosetta stone problem: Identifiability results for
multi-view nonlinear ica.
Uncertainty in Artificial Intelligence, pages 217–227. PMLR.
- He et al. (2015) He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for image recognition. CoRR, abs/1512.03385.
Hyvarinen and Morioka (2016)
Hyvarinen, A. and Morioka, H. (2016).
Unsupervised feature extraction by time-contrastive learning and nonlinear ica.Advances in Neural Information Processing Systems, 29.
- Hyvarinen and Morioka (2017) Hyvarinen, A. and Morioka, H. (2017). Nonlinear ica of temporally dependent stationary sources. In Artificial Intelligence and Statistics, pages 460–469. PMLR.
Hyvärinen and Pajunen (1999)
Hyvärinen, A. and Pajunen, P. (1999).
Nonlinear independent component analysis: Existence and uniqueness results.Neural networks, 12(3):429–439.
- Hyvarinen et al. (2019) Hyvarinen, A., Sasaki, H., and Turner, R. (2019). Nonlinear ica using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 859–868. PMLR.
Khemakhem et al. (2020a)
Khemakhem, I., Kingma, D., Monti, R., and Hyvarinen, A. (2020a).
Variational autoencoders and nonlinear ica: A unifying framework.In International Conference on Artificial Intelligence and Statistics, pages 2207–2217. PMLR.
- Khemakhem et al. (2020b) Khemakhem, I., Monti, R., Kingma, D., and Hyvarinen, A. (2020b). Ice-beem: Identifiable conditional energy-based deep models based on nonlinear ica. Advances in Neural Information Processing Systems, 33:12768–12778.
- Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization.
- Klindt et al. (2021) Klindt, D. A., Schott, L., Sharma, Y., Ustyuzhaninov, I., Brendel, W., Bethge, M., and Paiton, D. (2021). Towards nonlinear disentanglement in natural data with temporal sparse coding. In International Conference on Learning Representations.
- Lachapelle et al. (2022) Lachapelle, S., Rodriguez, P., Sharma, Y., Everett, K. E., PRIOL, R. L., Lacoste, A., and Lacoste-Julien, S. (2022). Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA. In First Conference on Causal Learning and Reasoning.
- Lippe et al. (2022) Lippe, P., Magliacane, S., Löwe, S., Asano, Y. M., Cohen, T., and Gavves, E. (2022). Citris: Causal identifiability from temporal intervened sequences. arXiv preprint arXiv:2202.03169.
Locatello et al. (2019)
Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B.,
and Bachem, O. (2019).
Challenging common assumptions in the unsupervised learning of disentangled representations.In
international conference on machine learning, pages 4114–4124. PMLR.
- Locatello et al. (2020) Locatello, F., Poole, B., Rätsch, G., Schölkopf, B., Bachem, O., and Tschannen, M. (2020). Weakly-supervised disentanglement without compromises. In International Conference on Machine Learning, pages 6348–6359. PMLR.
- Mityagin (2015) Mityagin, B. (2015). The zero set of a real analytic function. arXiv preprint arXiv:1512.07276.
- Pearl (2009) Pearl, J. (2009). Causality. Cambridge university press.
- Peters et al. (2017) Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of causal inference: foundations and learning algorithms.
- Radford et al. (2021) Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020.
- Schölkopf et al. (2021) Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. (2021). Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634.
- Schölkopf and von Kügelgen (2022) Schölkopf, B. and von Kügelgen, J. (2022). From statistical to causal learning. arXiv preprint arXiv:2204.00607.
- Shinners (2011) Shinners, P. (2011). Pygame. http://pygame.org/.
- Szegedy et al. (2016) Szegedy, C., Ioffe, S., and Vanhoucke, V. (2016). Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261.
- Von Kügelgen et al. (2021) Von Kügelgen, J., Sharma, Y., Gresele, L., Brendel, W., Schölkopf, B., Besserve, M., and Locatello, F. (2021). Self-supervised learning with data augmentations provably isolates content from style. Advances in Neural Information Processing Systems, 34.
- Yao et al. (2021) Yao, W., Sun, Y., Ho, A., Sun, C., and Zhang, K. (2021). Learning temporally causal latent processes from general temporal data. arXiv preprint arXiv:2110.05428.
- Zimmermann et al. (2021) Zimmermann, R. S., Sharma, Y., Schneider, S., Bethge, M., and Brendel, W. (2021). Contrastive learning inverts the data generating process. In International Conference on Machine Learning, pages 12979–12990. PMLR.
We organize the Appendix into three sections. In Section 7.1, we provide the proofs to all the propositions and the theorems. In Section 7.2, we discuss how some of the proposed results can be extended. In Section 7.3, we provide supplementary materials for the experiments.
We restate all the propositions and the theorems below for convenience. In the proofs that follow, we use () to denote the set of perturbations and the matrix of perturbations interchangeably (their usage is clear from the context). We start with the proof of Proposition 1, which follows the proof technique from Ahuja et al. (2022).
We simplify the identity in equation (3) as follows.
In the above simplification, we use the following observation. Since and are generated from and is injective, we can substitute and , where .
For simplicity denote the last line in above equation (6) as
We take gradient of the function in the LHS and RHS of the above equation (7) separately w.r.t . Consider the component of denoted as . We first take the gradient of w.r.t below.
where , is the gradient of w.r.t and denotes the Jacobian of w.r.t . We simplify the above further to get
We can write the above for each component of as follows.
where is the Jacobian of computed at . We equate the gradient of LHS and RHS in (7) to obtain
Consider row of this identity. For each
where is the Hessian of and corresponds to the row of the Hessian matrix. Note that in the above expansion there is a different for each row (mean value theorem applied to each component of yields a different point on the line joinining and ). From Assumption 3 it follows that over a set with non-zero measure. Since is analytic is also analytic (each component of the vector is a weighted sum of analytic functions). Therefore, we can conclude that for all (follows from Mityagin (2015)). We can make the same argument for each component and conclude that . From the identity in equation (3), it follows that for all and since the set is linearly independent for all . This implies .
We substitute this in equation (6) to get , where is the matrix of true perturbations and is the matrix of guessed perturbations (recall we stated above that we use as sets and matrices interchangeably). We now need to show that is invertible. Suppose was not invertible, which implies the rank of . Following Assumption 2, rank of is . Note that rank of is also . Note that if , where , , are three matrices, then . Following this identity, , which is a contradiction. Therefore, has to be invertible. This completes the proof.
If Assumptions 1-4 hold and the number of perturbations per example equals the latent dimension, , then the encoder that solves equation (3) (with as one-sparse and ) identifies true latents up to permutation and scaling, i.e. , where is an invertible diagonal matrix, is a permutation matrix and is an offset.
Since Assumptions 1, 2, and 3 hold, we can use Proposition 1 to obtain that any solution to equation (3) achieves affine identification guarantees w.r.t the true latents, i.e. , where , is the inverse image of (), is an inverible matrix and is the offset vector.
Define as the vector, which takes a value at component and everywhere else. Without loss of generality set of true perturbations is . Note that all ’s are non-zero as the span of has a dimension .
Denote the corresponding set of guesses from the agent are , where is a map used by the agent to guess the coordinate impacted by the perturbation. Note that since spans dimensions has to be a bijection ’s are non-zero as the span of .
Take and the corresponding guess and substitute it in the relation to get
Since is a bijection for every there is a unique in the RHS. From the above equation, we gather that the column of is . We apply this to all the columns and conculde that , where is a diagonal matrix and is a permutation matrix decided based on the bijection (, where is the colum of the matrix). ∎
Since Assumptions 1, 2, and 3 hold, we can use Proposition 1 to obtain that any solution to equation (3) achieves affine identification guarantees w.r.t the true latents, i.e. , where , is the inverse image of (), is an inverible matrix and is the offset vector.
We start the proof by assuming that the agent knows the blocks that are impacted under each perturbation, i.e., for each , the agent knows the block of the latents that are impacted denoted as . We relax this assumption later.
Following Assumption 5, we know that perturbations are -sparse, blockwise and non-overlapping. Without loss of generality, we can assume that the different groups on which perturbations in act are given as , and so on. Consider a perturbation , which belongs to Group and impacts the latents . For this perturbation, the agent selects , which shares the same sparsity pattern. Therefore, that the first elements of and are both non-zero and the rest of the elements are zero. Under these assumptions, we can write the relationship between true and guessed perturbations as follows.
Denote the first elements of row of matrix as and the first elements of the vector as . For , we use the equation (14) to get .
Let denote the number of perturbations in Group , where . For all we can solve for the first block using the perturbations guessed by the agent and the true perturbations in Group . Denote the first block of as and the first components of the perturbations in Group as . Similarly, the first components of the perturbations guessed by the learner is denoted as . We now need to show that the block is invertible. From the above equation in (14), we get
where is the number of perturbations in Group .
Since rank of and is , the rank of cannot be less than or else it would lead to a contradiction. This shows that is invertible. We derived the properties of the first columns of matrix . For Group 2, we similarly obtain that is an invertible matrix and rest of the elements in columns are zero. Due to symmetry of the setting, we can apply the same argument to all the other blocks as well. Therefore, we conclude that is block-diagonal and invertible. This leads to the conclusion that , where and .
So far we assumed that the agent knows how the interventions in impact the blocks . Under Assumption 6, the agent knows the groups of the perturbations only. For perturbations in Group that impact , the agent guesses . Note that perturbations in impact the same block of length with indices given as . Recall the first elements of row of matrix and vector are denoted as and respectively. There exist rows in for which we get . Thus for all these rows. The first elements of remaining rows form a square matrix denoted as , where are the indices guessed by the agent for the block corresponding to Group 1. satisfies
where is the matrix of non-zero components of the perturbation vectors that the agent guesses. Using the same argument as above, we can argue that is invertible. We have derived the properties of first columns of . We apply the same argument to other groups as well. Since the agent selects a set of unique indices for each group, we obtain that the matrix can be factorized as a permutation matrix times a block diagonal matrix. The first rows of the permutation matrix with index have ones at locations and so on. As a result, we get that .
This completes the proof. ∎
Following Assumption 7, we know that there exists at least two subsets and that satisfy blockwise non-overlapping perturbations. Like in the previous proof, we start this proof also with the case where the agent knows the exact sparsity pattern in the perturbations. We relax this assumption in a bit. Consider a block impacted by the perturbations in . Since is blockwise and non-overlapping, we can follow the analysis in the first part of the previous theorem to get is an invertible affine transform of . Hence, the latents in each of the blocks are identified up to an afffine transform. Similarly, each block is identified up to an affine transform. Consider an element . can be expressed as an affine transform of two different blocks of latents and . and share some components, we denote them as . The components exclusive to () are denoted as ().
We write this condition as follows.
If is non-zero, i.e., at least one element is non-zero, then the range of LHS is . But the range of the RHS is a constant. Therefore, for the above to be true and that implies . As a result, the linear entanglement is now confined to only the intersecting variables . We can repeat this argument for all elements in .
In the proof above, we relied on the assumption that the components impacted by each intervention are known. We now relax this assumption and work with assumption that was used in the previous theorem (Assumption 6), which states that the agent knows the group label of each perturbation.
Consider the latents in the block , which we denote as . We apply Theorem 2 to this block. Let the set of estimated latents that affine identify be , where is the set of indices in . We write this as