# A theory of independent mechanisms for extrapolation in generative models

Deep generative models reproduce complex empirical data but cannot extrapolate to novel environments. An intuitive idea to promote extrapolation capabilities is to enforce the architecture to have the modular structure of a causal graphical model, where one can intervene on each module independently of the others in the graph. We develop a framework to formalize this intuition, using the principle of Independent Causal Mechanisms, and show how over-parameterization of generative neural networks can hinder extrapolation capabilities. Our experiments on the generation of human faces shows successive layers of a generator architecture implement independent mechanisms to some extent, allowing meaningful extrapolations. Finally, we illustrate that independence of mechanisms may be enforced during training to improve extrapolation.

## Authors

• 13 publications
• 4 publications
• 40 publications
• 178 publications
• ### Learning Generative Models with Visual Attention

Attention has long been proposed by psychologists as important for effec...
12/20/2013 ∙ by Yichuan Tang, et al. ∙ 0

• ### Learning Robust Models Using The Principle of Independent Causal Mechanisms

Standard supervised learning breaks down under data distribution shift. ...
10/14/2020 ∙ by Jens Müller, et al. ∙ 0

• ### Learning Independent Causal Mechanisms

Independent causal mechanisms are a central concept in the study of caus...
12/04/2017 ∙ by Giambattista Parascandolo, et al. ∙ 0

• ### Structure and Parameter Learning for Causal Independence and Causal Interaction Models

This paper discusses causal independence models and a generalization of ...
02/06/2013 ∙ by Christopher Meek, et al. ∙ 0

• ### Reducing the Computational Cost of Deep Generative Models with Binary Neural Networks

Deep generative models provide a powerful set of tools to understand rea...
10/26/2020 ∙ by Thomas Bird, et al. ∙ 0

• ### Ab initio Algorithmic Causal Deconvolution of Intertwined Programs and Networks by Generative Mechanism

To extract and learn representations leading to generative mechanisms fr...
02/18/2018 ∙ by Hector Zenil, et al. ∙ 0

• ### Out-domain examples for generative models

Deep generative models are being increasingly used in a wide variety of ...
03/07/2019 ∙ by Dario Pasquini, et al. ∙ 18

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

### 1 Introduction

Deep generative models such as Generative Adversarial Networks (GANs) (Goodfellow et al., 2014)

, and Variational Autoencoders (VAEs)

(Kingma & Welling, 2013; Rezende et al., 2014) are able to learn complex structured data such as natural images. However, once such network has been trained on a particular dataset, can it can be leveraged to simulate distributions with meaningful differences?

A causal model allows the different mechanisms involved in generating the data to be intervened on independently, based on the principle of Independent Causal Mechanisms (Janzing & Schölkopf, 2010; Lemeire & Janzing, 2012; Peters et al., 2017). Having the internal computations performed by a multi-layer generative model reflect the causal structure of the data generating mechanism would thus endow it with a form of modularity, such that particular transformations performed at intermediate layers cause meaningful and predictable changes in the output distribution. We call extrapolation the ability to predict such changes, following the intuitive idea that it involves generalizing beyond the support of the distribution sampled during training.

In this paper, we elaborate a general framework to assess extrapolation capabilities of generative models based on their modularity. We show how non-identifiabilty due to overparameterization of the model can hinder modularity and entangle the mechanisms implemented by successive layers of the network. We use spectral independence (Shajarisales et al., 2015) to quantify this entanglement between successive layers of a neural networks. Experiments show VAEs trained on the CelebA face dataset are to some extent disentangled, allowing geometric transformations of their intermediate activation maps to alter specific aspects of their output. Finally we show how optimizing spectral independence can improve such desirable properties. Readers can refer to appendices for a list of supplementary files and code resources, symbols and acronyms (Table LABEL:tab:symbols), all proofs (App. A) and methods details (App. B).

Related work.

Deep neural network have been leveraged in causal inference for learning causal graphs between observed variables (Lopez-Paz & Oquab, 2016) and associated causal effects (Louizos et al., 2017; Shalit et al., 2017; Kocaoglu et al., 2017). Our goal is more akin to disentangling unobserved independent causal mechanisms and leverages the internal causal structure of generative models to do so. This is based on group invariance principles akin to Besserve et al. (2018), and relates to generating counterfactuals in deep generative models (Besserve et al., 2020; Bau et al., 2018). The notion of extrapolation we investigate has also been investigated in the context of dynamical systems (Martius & Lampert, 2016)

. Finally, our investigation of overparametrization relates to studies in ReLU networks

(Neyshabur et al., 2017). This later work and (Zhang et al., 2016)

, argue that Stochastic Gradient Descent (SGD) implements an

implicit regularization

beneficial to supervised learning, while we provide a different perspective in the unsupervised setting.

Symbols and acronyms.

Abbrev./Symbol Name Eq.
NF Normalizing Flow
function composition 1
circular convolution 4
true parameters
solution set 3
COS / Composed Over-parametrization Set 5
extrapolated class 6
generic ratio 7
TR Trace Ratio 11
SDR / Spectral Density Ratio 9
entrywise product 4

### 2 Extrapolation in generative models

#### 2.1 FluoHair: A paradigmatic extrapolation in VAEs

Let us first illustrate what we mean by extrapolation and its relevance for deep generative models with a straightforward transformation: color change. “Fluorescent” hair colors are at least very infrequent in the CelebFaces Attributes Dataset (CelebA), such that generative models trained on it are very unlikely to generate samples with such attribute. Given such hair styles do exist, it represents a plausible modification of the distribution learned from the training set that we may wish to implement to endow a model with extrapolation capabilities. Fig. 1 shows, after identifying channels representing hair in the second to last layer of a trained VAE (based on the approach of Besserve et al. (2020)), how the convolution kernel of the last layer can be modified to generate faces with various types of fluorescence (see in App. B). Importantly, the “shape” of the hair cut, controlled by parameters in the above layers, remains the same, illustrating layer-wise modularity of the network. Notably, this also shows the unsupervised aspect of our approach to extrapolation: no labeling or selection of data is exploited. Such transformation of an element of the computational graph of the generative model will guide our framework. While this color editing example is straightforward, theory and applications will be illustrated on a richer class of interventions on the shape of visual features encoded at multiple spatial scales by the hierarchy of convolutional layers (Model 1).

#### 2.2 Neural networks as structural causal models

We consider multi-layer generative models consisting of successively composed functions , parameterized by , applied to a latent variable with a fixed distribution, such that the resulting distribution of

 X=fθKK∘⋯∘fθ11(Z). (1)

fits the one of observed data . We denote the distribution of for any in and the model class. In this theoretical section, we assume the observed data

has been generated by the model for the vector of so-called

true parameters , such that . We will call normalizing flow (NF) models those in which all ’s are invertible for all possible choices of parameters (following Rezende & Mohamed (2015)).

We assume the optimization procedure fits perfectly the data distribution by choosing the vector of parameters . We thus define the solution set

 Sθ∗={θ∈T|Dθ=Dθ∗},

such that . If , a causal interpretation of the model is possible, such that each function represents a specific causal mechanism in the true data generative process of . We call such case structural identifiability, and one can then interpret eq. (1) as a structural causal model (Pearl, 2000) for the chain , where

is the random variable at the output of

. The ICM principle at the heart of causal reasoning then allows extrapolation to other plausible distributions of output by intervening on one function while the other functions are kept fixed. In contrast, if is non-singleton and a is chosen by the learning algorithm, extrapolation is, in general, not guaranteed to behave like the true solution.

#### 2.3 Two layer model and first example

To simplify notations and without loss of generality, we will focus on fitting a cascade of two functions

 X=fθ22(V)=fθ22(fθ11(Z)). (2)

Let be the range of , and , the two parametric families of functions. We assume that the mappings are bijective, such that we can abusively denote parameter values by their corresponding functions. Thus is viewed as a set of function pairs:

 Sθ∗={(f1,f2)∈F1×F2|D(f1,f2)=Dθ∗}. (3)

Our framework will be illustrated on the following model.

###### Model 1 (Linear 2-layer convNet).

Assume prime number222This allows enforcing simple group structures (e.g. Prop. 2), a random binary latent image, such that one single pixel is set to one at each realization, and location of that pixel is drawn at random uniformly over the image. Let be two invertible convolution kernels, and

 (4)

where indicates the circular convolution (modulo d).

The reader can refer to App. B.2 for a background on circular convolution. Such model can be used to put several copies of the same object in a particular spatial configuration at a random position in an image. The following example (Fig. 2

) is an “eye generator” putting an eye shape at two locations separated horizontally by a fixed distance in an image to model the eyes of a human face. The location of this “eye pair” in the whole image is uniformly distributed.

###### Example 1 (Eye generator, Fig. 2).

Consider Model 1 with a convolution kernel taking non-zero values within a minimal square of side encoding the eye shape, and with only two non-vanishing pixels, encoding the relative position of each eye.

#### 2.4 Characterization of the solution set

We can readily see that Model 1 admits “trivial” alternatives to the true parameters to fit the data perfectly, simply by left-composing arbitrary rescalings and translations with , and right-composing the inverse transformation to . This is in line with observations by Neyshabur et al. (2017)

on over-parameterization of ReLU networks. Indeed, the ReLU activation function

commutes with positive linear rescalings, such that . Incoming synaptic weights of a unit can thus be downscaled while upscaling all outgoing synaptic weights without changing the input-output mapping of the network.

From these examples, we derive a general analysis of over-parametrization entailed by composing two functions: let be the subset of right-invertible functions such that and .333using the convention Trivially, contains at least the identity map. For any true parameter , we define the Composed Over-parametrization Set (COS)

 SΩθ∗={(ω−1∘fθ∗11,fθ∗22∘ω)|ω∈Ω}. (5)

The COS reflects how “internal” operations in make the optimization problem under-determined because they can be compensated by internal operations in neighboring layers. By definition, the COS is obviously a subset of the solution set , but inclusion turns into equality in NF cases.

###### Proposition 1.

For an NF model, is a group and for any true parameter , .

This has a direct consequence for Model 1.

###### Corollary 1.

Model 1 is NF, such that with the set of invertible convolution kernels.

We will exploit the COS structure to study the link between identifiability and extrapolation, which we define next.

#### 2.5 Extrapolated class of distributions

Humans can generalize from observed data by envisioning objects that were not previously observed, such as a pink elephant, akin to our FluoHair example (Fig. 1). To investigate rigorously the notion of extrapolation, we introduce a class of distributions that we coin extrapolated class, which contains the true distribution , as well as distributions generated by modifications of the true generative model. In line with the pink elephant allegory, we assume results from manipulating an abstract/internal representation (in this case encoding the concept of skin color) instantiated in our case by vector through transformations of taken (typically) from a group444Background on group theoretic concepts (further used in our analysis) is provided in Appendix B.1. :

 MG(f∗1,f∗2)=MGθ∗≜{D(g⋅f∗1,f∗2),g∈G}, (6)

where denotes the group action of on , transforming it into another element of . thus encodes the inductive bias used to extrapolate from one learned model to others (when is unambiguous is denoted ).555For simplicity acts on , however, acting instead on , or both, can be handled in a similar way. An illustration of the principle of extrapolated class (assuming smooth manifold structure for all sets), is illustrated on Suppl.  Fig. 7. For Model 1, a possible group is the set of all rescalings by a non-zero integer.

###### Proposition 2 (Scaling group).

The set of muliplications by non-zero integers modulo is a group, and turning kernel into , is a group action.

Such transformations can model classical feature variations in naturalistic images: as an illustration in Example 1, using this group action leads to an extrapolated class that comprises models with various distances between the eyes, corresponding to a likely variation of human face properties. See Fig. 2, top row for an example extrapolation.

#### 2.6 Equivalence of solutions for extrapolation

As studied in Sec. 2.4, the solution set may not be singleton such that the solution found be the learning algorithm may not be the true pair , leading to a different extrapolated class. When classes happen to be the same, we say solution is -equivalent.

###### Definition 1 (G-equivalence).

The solution is -equivalent to the true if .

An illustration of -equivalence violation for Example 1 is shown in Fig. 2, and an additional representation of the phenomenon is given in Suppl. Fig. 7. Equivalence for extrapolation imposes additional requirements on solutions.

###### Proposition 3.

Let be the group of circular Fourier coefficients permutations, a solution for Model 1, given true model , is -equivalent requires there is a such that .

Implying -equivalence is achieved only for solutions that are similar to the true parameters (up to trivial rescaling) and is only slightly weaker than identifiability.

#### 2.7 G-genericity as weakened G-equivalence

As we do not have a computationally tractable method to assess -equivalence in practice, we resort to characterizing invariant properties of to select solutions. Indeed, if is a set that “generalizes” the true model distribution , it should be possible to express the fact that some property of is generic in . Let a contrast function that captures approximately the relevant property of , we check that such function does not change on average when applying random transformations from . For a compact group, it is natural to sample group elements from the Haar measure of the group 666 is a “uniform” distribution on , see App. B., leading to

###### Definition 2 (Contrast based genericity).

Let a function mapping distributions of to , and a compact group. For any solution of the model fit procedure, we define the generic ratio

 φ(~f1,~f2)=φ(~f1(Z),~f2)≜~φ(D(~f1,~f2))Eg∼μG~φ(D(g⋅~f1,~f2)) (7)

and say solution is (approximately) -generic w.r.t. , whenever it satisfies (approximately) .

It then follows naturally from the definition that -equivalence entails a form of -genericity.

###### Proposition 4.

If is constant on , then -equivalent implies -generic w.r.t. and is -generic.

Interestingly, while we introduce genericity to assess extrapolation, it was defined by Besserve et al. (2018) as a measure of independence between cause and mechanism . We use both the “parametric” notation and the “cause-effect” notation .

In our context, independence reflects the assumption that the properties of are modulated by factors (pertaining to ) that have nothing to do with the ones affecting the mechanism that computes from . In the above definition of , these modulating factors are modeled by applying elements selected at random, that turn into a perturbed version .

#### 2.8 Scale and spectral independence

In the case of Model 1, one possible contrast is the total Power of the output . Indeed, it is already clear that power will not change after rescaling the distance between eyes in Example 1

, provided we make sure both eyes do not overlap. This view has a frequency domain interpretation: as circular convolution turns into an element wise product in the Fourier domain (see App.

B.3

), the Discrete Fourier Transform (DFT) of

has modulus one on all pixels, and Parseval theorem yields

 (8)

where is the DFT of , denotes averaging over 2D frequencies and is the entrywise product. Another way to intervene on the scales is thus to apply elements from the group of circular permutations to the frequencies of . We obtain the -generic ratio for Model 1.

 ρ(V,k2)=⟨E|ˆk2⊙ˆV|2⟩⟨E|ˆV|2⟩⟨|ˆk2|2⟩=⟨|ˆk2⊙ˆk1|2⟩⟨|ˆk1|2⟩⟨|ˆk2|2⟩, (9)

which we call Spectral Density Ratio (SDR), as it appears as a discrete frequency version of the quantity introduced by Shajarisales et al. (2015). This leads to the following result

###### Proposition 5.

For Example 1, if , then , for preserving this previous inequality. Moreover, the true solution of Example 1 is -generic with respect to .

In line with Shajarisales et al. (2015), we say such -generic solution w.r.t. satisfies spectral independence. This supports the use of SDR to check whether successive convolution layers implement mechanisms at independent scales.

#### 2.9 Causal versus anti-causal models

An interesting application of genericity is identifying the direction of causation : in several settings, if for the causal direction , then the anti-causal direction is not generic as . We can use spectral independence to check a causal/anti-causal interpretation of the decoder/encoder architectures of VAEs.

### 3 How learning influences extrapolation

When models are over-parameterized, the learning algorithm likely affects the choice of parameters, and thus the extrapolation properties introduced above. We will rely on a simplification of Model 1 that allows to study the mechanisms at play without the heavier formalism of convolution operations.

#### 3.1 Diagonal model

###### Model 2.

Consider the linear generative model

 X=ABZ (10)

with , square positive definite diagonal matrices and a vector of positive independent random variables such that .

Model 2 can be seen as a Fourier domain version of Model 1, with some technicalities dropped (e.g. we use real positive numbers instead of complex numbers) and we get analogous results as for Model 1 regarding the solution set and -equivalence (see Prop. 2 and Corol. 10 in App. B.4).

Because its structure apparently differs from Model 1, a different form of genericity is expected for this new model. However, as elaborated by Shajarisales et al. (2015), the Trace Method (Janzing et al., 2010) that quantifies genericity with respect to the group of orthogonal transformations is the Fourier domain equivalent of SDR analysis. In the context of Model 2, it relies on the energy contrast

 ~φ(B,A)=τ[ABB⊤A⊤]=1/dd∑i=1a2ib2i

where is the normalized trace . And we consider the smaller (cyclic) group of of circular permutation of the matrix coordinates. This leads to the trace ratio

 TR(B,A)≜τ[ABB⊤A⊤]τ[AA⊤]τ[BB⊤], (11)
###### Proposition 6.

In Model 2, TR is a generic ratio for -genericity w.r.t. .

A way to illustrate how this generic ratio entails a form of independence (akin to spectral independence) is to assume all diagonal coefficients of each matrix are chosen by Nature as independent realizations of two independent random variables and respectively, we get asymptotic genericity,

 TR(B,A)⟶d→+∞E[a2b2]E[a2]E[b2]=1,

that is, the true parameters that generated the observation exhibit genericity, due to our ICM-compatible choice. It is easy to see from this formula that TR quantifies the correlation between the squared diagonal entries of each matrices, with two forms of dependency: for , squared diagonal entries are positively correlated, while for , they are negatively correlated. This interpretation of TR generalizes to the case of non diagonal matrices if and

share the same eigenvector basis (then correlation is quantified between respective eigenvalues of these matrices). TR can also be interpreted using as measuring

freeness

in free probability theory

(Zscheischler et al., 2011).

#### 3.2 Drift of over-parameterized solutions

Consider Model 2 in the 1D case. We consider a VAE-like training: conditional on the latent variable

, the observed data is assumed Gaussian with fixed variance

and mean given by the generator’s output . To simplify the theoretical analysis, we study only the decoder of the VAE, and thus assume a fixed latent value , (i.e. the encoder part of the VAE infers a Dirac for the posterior of given the data). Assuming the true model , we thus use data sampled from , and learn from it.

First, considering infinite amounts of data, the maximum likelihood framework leads to the least square objective

 minimizea,b>0L(c;(a,b))=|c−ab|2. (12)

We study the solution of the deterministic continuous time gradient descent (CTGD, see proof of Prop. 7 in App. A for the exact meaning) for this objective.

###### Proposition 7.

Consider the CTGD of problem (12), from any initial point the trajectory leaves the quantity unchanged and converges to the intersection point with .

Typical trajectories are represented in red on Fig. 3.

Now consider the more practical setting of SGD for training the VAE’s decoder objective: for each data sample, maximum likelihood leads to the associated stochastic objective (note with our assumptions batch gradient descent would lead to the same form of objective).

 minimizea,b>0ℓ(c0;ω;(a,b))=|C(ω)−ab|2,C∼N(c0,σ2). (13)

We follow the sequential update rules of Algorithm 1 with learning rate .

The result (green sample path Fig. 3) is very different from the deterministic case, as the trajectory drifts along to asymptotically reach a neighborhood of . This drift is likely caused by by asymmetries of the optimization landscape in the neighborhood of the optimal set . This phenomenon relates to observations of an implicit regularization behavior of SGD (Zhang et al., 2016; Neyshabur et al., 2017)

, as it exhibits the same convergence to the minimum Euclidean norm solution. While an in depth analysis of the asymptotic distribution of the solutions may require sophisticated tools related to Markov chains on general state spaces (e.g.

(Tweedie, 1974)), we provide the following result on the evolution of a distribution close to .

###### Proposition 8.

Assume an initial distribution and , such that , then after one SGD iteration, the updated values satisfy (using as above)

 E[L(A(1),B(1))]=ηE[L(A(0),B(0))],0<η<1.

Proof is in Appendix A. This result suggests that an SGD iteration makes points in the neighborhood of evolve (on average) towards , corresponding to the subset , such that after many iterations the distribution concentrates around .

Interestingly, if we try other variants of stochastic optimization on the same deterministic objective, we can get different dynamics for the drift, suggesting that it is influenced by the precise algorithm used (see App. C.1 for the case of Asynchronous SGD, with example drift in blue on Fig. 3).

These drift phenomena seem to introduce a form of non-statistical dependency between the two parameters, and , of the solution that we will call parametric entanglement. We can quantify this as lack of independence between mechanisms as measured with genericity in Sec. 3.1.

#### 3.3 Entanglement of SGD solutions

We now get back to the multidimensional setting for Model 2 to provide support to our intuition that the drift phenomena observed in Section 3.2 leads to parametric entanglement. Transposing the above SGD setting to the -dimensional case trivially leads to the same behavior for each component, which evolve independently from each other. Interestingly, we can then show TR is consistent with SGD drift inducing entanglement, assuming the drift leads to the matrix square root solution for both factors, as observed in Sec. 3.2:

###### Proposition 9.

In Model 2, assume a diagonal coefficients of the true parameters and are i.i.d. sampled from two arbitrary non constant distributions. Then, the solution satisfies

 \rm TR(B,A)⟶d→+∞E[c21]/E[c1]2>1,

and is thus not -generic.

This implies the TR will detect a positive correlation between the matrices and , which is caused by the particular choice of a solution within by the SGD algorithm.

#### 3.4 Extension to convolutional Model 1

We show qualitatively how the above entanglement result observed in Model 2 can provide insights for the case of Model 1. Using the same VAE-like SGD optimization framework for this case, where we consider a fixed , being this time a Dirac pixel at location . We apply the DFT to in Model 1 and use the Parseval formula to convert the least square optimization problem to the Fourier domain. Simulating SGD of the real and imaginary parts of and , we see in Fig. 3 the same drift behavior towards solutions having identical squared modulus (), as described for Model 2 in Sec. 3.2. As a consequence, the TR generic ratio of eq. 11 may be applied to the diagonal matrices implementing the convolutions in the Fourier domain to quantify entanglement induced by SGD optimization as shown in Prop. 9. Interestingly, this generic ratio measuring SGD induced entanglement corresponds to the generic ratio measuring scale separation elaborated in Sec. 2.7, eq. (9):

 TR(ˆk∗1,ˆk∗2)=ρ(k∗1,k∗2). (14)

### 4 Application to deep generative models

We now exploit our framework in the context of deep convolutional networks trained on complex real world data.

#### 4.1 Methodology

We consider two successive layers. As show on Fig. 4, a difference with Model 1 studied in previous sections, a single layer consists of multiple 2D activation maps, called channels, to which are applied convolutions and non-linearities to yield activations maps forwarded to the next layer. More precisely, an activation map (corresponding to one channel in the considered layer) is generated from the channels’ activation maps in the previous layer through the multi-channel kernel as

 X (15)

By looking only at “partial” activation map , we get back to the case of Model 1 (up to some additive constant bias). Therefore, unless specified otherwise, the term filter will refer to partial filters .

Next, to get an empirical estimate of SDR for the partial filter, we consider its cause-effect formulation in eq.

9 and we estimate the expectation with an empirical average of the batch of samples of size .

 ρ(Vi,ki2)=⟨E|ˆki2⊙ˆVi|2⟩⟨E|ˆVi|2⟩⟨|ˆki2|2⟩≈⟨1B|ˆki2⊙ˆvki|2⟩⟨1B|ˆvki|2⟩⟨|ˆki2|2⟩, (16)

One additional difference with respect to Model 1 is a stride parameter interleaving zero-value pixels between each input pixels, along each dimension, before applying the convolution operation, in order to progressively increase the dimension and resolution of the image from one layer to the next. As shown in App. B.5, this can be easily modeled and leads to a slightly different SDR (eq. 24).

#### 4.2 Enforcing spectral independence

In order to enforce genericity, direct optimization of the euclidean distance to 1 of the SDR statistic is challenging due to the normalization term in eq. (16). To avoid this, for a fixed activation map, we multiply the square difference between the SDR and its ideal value of 1 by the normalization term and optimize this quantity. For a single (filter,map) pair, this leads to the minimization of the objective

 ⟨|ˆki2|2⊙⎛⎜⎝1B|ˆvki|2⟨1B|ˆvki|2⟩−1⎞⎟⎠⟩2, (17)

and for multiple pairs the sum of these objectives is used.

### 5 Experiments on deep face generators

We now empirically assess genericity in deep convolutional generative networks based on SDR analysis and extrapolations. This is done in the context of learning the distribution of CelebA. We used a plain -VAE (Higgins et al. (2017)) and the official tensorlayer DCGAN implementation. The general structure of the VAE is summarized in Fig. 4 and the DCGAN architecture is very similar (details in Table 1). We denote the 4 different convolutional layers as indicated in Fig. 4: coarse (closest to latent variables), intermediate, fine and image level.

#### 5.1 SIC between successive deconvolutional units

We first study the distribution of the SDR statistic between all possible (filter, activation map) pairs in a given layer. The result for the VAE is shown in Fig. 5, exhibiting a mode of the SDR close to 1 - the value of ideal spectral independence - for layers of the decoder, which suggests genericity of the convolution kernels between successive layers. Interestingly, the encoder architecture, which implements convolutional layers of the same dimensions in reverse order, exhibits a much broader distribution of the SDR at all levels, especially for layers encoding lower level image features. This is in line with results stating that if a mechanism (here the generator) satisfies the principle of independent causal mechanisms, the inverse mechanism (here the encoder) will not (Shajarisales et al., 2015).

In supplemental Fig. 8, we also show the same analysis for a GAN. While genericity is slightly better in the generator, it is also rather good for the discriminator, in line with the fact that the discriminator does not perform an inverse mapping of the generator. Overall, results support ICM is achieved to some extent in vanilla generator architectures. Next, we thus can investigate extrapolation capabilities formalized in Section 2.5 by perturbing hidden layers.

#### 5.2 Evaluating extrapolation across spatial scales

We intervene at a particular scale by applying a 1.5 fold horizontal stretching transformation to all maps of a given hidden convolutional layer and compare the resulting perturbed image to directly stretching to the output sample.

##### 5.2.1 Scale of convolutional layers

The images obtained by distorting convolutional layers’ activation maps are presented in Fig. 6(a) for the VAE trained with 10000 iterations. This affects differently features encoded at different scales of the picture: stretching the intermediate level activation maps (second row of Fig. 6)(a) mostly keeps the original dimensions of each eye, while inter-eye distance stretches. Interestingly, Fig. 6(b) replicating the result but after 40000 additional training iterations shows perturbed images of poorer quality. This suggests an increase of parametric entanglement with excessive training, in line with the drift phenomenon observed in 3.3. In particular, stronger periodic interference patterns like in Fig. 2) appear for the stretching of the fine level representation (compare Figs. 6(b) vs. 6(a), 3rd row).

We also used a discrete Haar wavelet transform of the images to isolate the contribution of each scale to the image (Mallat, 1999). For each scale, we computed the difference between stretched output image and the image obtained through stretching of the hidden layer. Resulting examples are plotted on Fig. 6(c) for 10000 training iterations. In accordance with the above observations, perturbations localized at the level of eyes, mouth and nose for the intermediate level (Fig. 6(c), second row), reflect that the dimensions of these patterns are not fully rescaled, although their position is. We then computed the mean squared error (MSE) resulting from the above differences over all pixels of images of a batch. The resulting histograms for each perturbed layer on Fig. 6(a) shows that the mismatch is stronger at scales corresponding the depth of the distorted layer. Interestingly, the artifacts observed at 50000 training iteration of the VAE on Fig. 6(d) at the fine level (last rows) suggest -equivalence violation, as obtained in example of Fig. 2.

##### 5.2.2 Evolution of extrapolation with training

As supported by the deterioration of visual quality of perturbed images generated after 50000 iterations, excessive SGD-like optimization may increase entanglement. To quantify this effect, we tracked the evolution (as the number of iterations grows) of the mean square errors for the complete picture (Fig. 6(f)), resulting from the stretch of the fine level convolutional layer. This difference grows as the training progresses and the same trend can be observed for the mean squared error of the complete picture. We investigated whether enforcing more genericity between layers during optimization can temper this effect. We trained a VAE by alternatively minimizing spectral dependence of eq. (17) at image, fine and intermediate levels, interleaved with one SGD iteration on the VAE objective. Fig. 6(f) shows a clear effect of spectral independence minimization on limiting the increase in the distortions as training evolves. This is confirmed by the analysis of pixel difference for 50000 iterations, as seen in Fig. 6(e): perturbations of the intermediate and fine level exhibit better localization, compared to what was obtained at the same number of iterations (Fig. 6(d)) with classical VAE training, supporting the link between extrapolation and spectral independence of Sec. 2.8.

#### 5.3 Extrapolation of specific features

To justify that extrapolation as introduced in Sec. 2.5 and illustrated in Example 1 is relevant in the context of deep generative models, we show now apply stretching to specific visual features. For that we rely on the approach of Besserve et al. (2020) to identify modules of channels in hidden layers that encode specific properties of the output images in a disentangled way. We apply this procedure on the VAE described above, and identify a group of channels distributed across hidden layers encoding eyes. We then applied the horizontal stretch described in previous sections, but only to activations of the channels in the intermediate layer that belong to the module encoding properties of the eyes. The resulting counterfactual samples, shown on Suppl. Fig. 10 (top panel), exhibit faces with disproportionate eyes, in the vein of the deformations that illustrators often apply to fictional characters that can be observed for examples in cartoons or animation movies.

Conclusion. We provide an ICM framework for multi-layered generators that quantify their extrapolation abilities. Experiments are consistent with our insights and suggest ICM can be enforced to counter over-parameterization effects. This framework can help understand internal representations and generalization in autonomous systems.

### Appendix A Proofs of main text propositions

#### a.1 Proof of Proposition 1

Let be the set of mappings such that . Then each takes the form for some (because the model is NF). This trivially implies that is a subgroup of the bijections . In the same way, the set of mappings such that is also a subgroup and is then also a subgroup.

Next, assume , then .

As a consequence, and for , which implies  . ∎

#### a.2 Proof of Corollary 1

Taking the steps of the above proof, it is easy to see that . As a consequence it is also the intersection .

#### a.3 Proof of Proposition 3

This essentially exploits the principles of the simpler proof of Prop. 10. We do the proof in the 1D case, which generalizeds to 2D images without fundamental differences. We use the matrix representation of elements of the circular permutation group . Take the “one step to the right” circular permutation matrix

 P=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣00011⋱⋱00⋱⋱00010⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

Assume belongs to the solution set for true parameters , this implies that both choices of parameters have the same statistics. In particular, in the Fourier domain we get,

 E|ˆX|2=|ˆk∗1|⊙2|ˆk∗2|⊙2=|ˆk1|⊙2|ˆk2|⊙2

where denotes the entrywise th-power of array . This implies

 |ˆk∗2|⊙−2|ˆk2|⊙2=|ˆk∗1|⊙2|ˆk1|⊙−2≜R (18)

where Following the steps of Proof of Proposition 10 below yields the result. ∎

#### a.4 Proof of Proposition 7

The gradient for objective in equation 12 is

 ∇aL = −2b(c−ab) (19) ∇bL = −2a(c−ab) (20)

Hence the dynamics of continuous time gradient descent is (assuming a unit learning rate without loss of generality)

 dadt=−∇aL = 2b(c−ab) (21) dbdt=−∇bL = 2a(c−ab) (22)

Thus the trajectories of this dynamical system satisfy the equation

implying that is a constant along the trajectories. Assuming , each trajectory satisfies for some constant .

If we restrict ourselves to the domain and , stationary points are the element of the hyperbola . ∎

#### a.5 Sketch of the proof of Proposition 8

The evolution of during SGD follows the difference equation

 L(an+1,bn+1)=a2n+1−b2n+1=(a2n−b2n)(1−4λ2(cn−anbn)2)

by expanding the left hand side and simplifying the expression by exploiting the independence and Gaussianity of , and we get the result. ∎

#### a.6 Proof of Proposition 5

Following the assumptions, is a discrete 2D image consisting in unit discrete Diracs located at different pixels. Without loss of generality, we assume one of these pixels is located at coordinate and the other at coordinate . Then the squared Discrete Fourier Transform (DFT) of writes

 |Fg(u,v)|2=1d2(2+2cos(2π(m0u+n0v)))

and its sum over frequencies is

 ⟨|Fg(u,v)|2⟩=2

Without loss of generality we assume that has unit energy, such that the SDR writes

Using the Fourier convolution-product calculation rules, this terms also corresponds to the value of circular convolution at index , where denotes mirroring of both spatial axes. Since the support of is bounded by a square of side , the support of is bounded by a square of side . Convolution of this quantity by (which is a sum of one central unit Diracs at and two side Diracs at ), yields a superposition of one central pattern and two translated versions of this term around , and this pattern is additionally periodized with period d along both dimensions (due to circularity of the convolution). Since by assumption , then the supports of the translated terms, as well as their periodized copies, do not reach index , and the value of is given by the central term

due to the unit energy assumption on . ∎

#### b.1 Elements of group theory

We introduce concisely the concepts and results of group theory necessary to this paper. The authors can refer for example to (Tung, 1985; Wijsman, 1990; Eaton, 1989) for more details.

###### Definition 3 (Group).

A set is said to form a group if there is an operation ‘*’, called group multiplication, such that:

1. For any , .

2. The operation is associative: , for all ,

3. There is one identity element such that, for all ,

4. Each has an inverse such that, .

A subset of is called a subgroup if it is a group under the same multiplication operation.

The following elementary properties are a direct consequence of the above definition: , , , for all .

###### Definition 4 (Topological group).

A locally compact Hausdorff topological group is a group equipped with a locally compact Hausdorff topology such that:

• is continuous,

• is continuous (using the product topology).

The -algebra generated by all open sets of G is called the Borel algebra of .

###### Definition 5 (Invariant measure).

Let be a topological group according to definition 4. Let be the set of continuous real valued functions with compact support on . A radon measure defined on Borel subsets is left invariant if for all and

 ∫Gf(g−1x)dμ(x)=∫Gf(x)dμ(x)

Such a measure is called a Haar measure.

A key result regarding topological groups is the existence and uniqueness up to a positive constant of the Haar measure (Eaton, 1989). Whenever is compact, the Haar measures are finite and we will denote the unique Haar measure such that , defining an invariant probability measure on the group.

#### b.2 Circular convolution

We provide here definition for a one dimensional signal, generalization to 2 dimensions is obvious.

Circular convolution of finite sequences and their Fourier analysis are best described by considering the signal periodic. In our developments, whenever appropriate, the signal can be considered as -periodic by defining for any , , whenever and . One way to describe these sequences is then to see them as functions of the quotient ring .

Given two -periodic sequences , their circular convolution is defined as

#### b.3 Fourier analysis of discrete signals and images

The Discrete Fourier Transform (DFT) of a periodic sequence is defined as

 ˆa(n)=∑k∈Zda[k]e−i2πnk/d,n∈Zd.

Note that the DFT of such sequence can as well be seen as a -periodic sequence. Importantly, the DFT is invertible using the formula

 a[k]=∑k∈Zdˆa[n]ei2πnk/d,n∈Zd.

By Parseval’s theorem, the energy (sum of squared coefficients) of the sequence can be expressed in the Fourier domain by . The Fourier transform can be easily generalized to 2D signals of the form , leading to a 2D function, 1-periodic with respect to both arguments

 ˆb(u,v)=∑k∈Z,l∈Zb[k,l]e−i2π(uk+vl),(u,v)∈R2.

In both the 1D and 2D cases, one interesting property of the DFT is that it transforms convolutions into entrywise products. This writes, for the 1D case

###### Proposition 10.

Let be the group of cyclic component permutations, a solution for Model 2 is -equivalent to true model if and only if there is a such that .

###### Proof.

We use the matrix representation of elements of the circular permutation group . Take the “one step to the right” circular permutation matrix

 P=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣00011⋱⋱00⋱⋱00010⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

Assume belongs to the solution set for true parameters , we get that

 (A∗)−1A=B∗B−1≜R (23)

By definition of -equivalence, it is required that and is thus in bijection with it, which implies that there is a permutation of the indices such that , for all . By averaging over circular permutations distributed according to the Haar measure, and turn into multiples of the identity and we get with and by using eq. 23 we get the requirement . The converse implication is trivial. ∎

###### Corollary 2.

Model 2 is an NF model for which is the set of square positive definite diagonal matrices, and .

###### Proof.

Let us first characterize . Assume , then for linear transformations associated to diagonal positive definite matrices. Thus is also the canonical linear transformation of a diagonal positive definite matrix. Conversely, assume is diagonal positive definite, then obviously and are also positive definite. Hence is (canonically associated to) the group of square diagonal positive definite matrices.

Second, let us show (converse inclusion holds by definition). Let , then the -th diagonal coefficient satisfies

 AB=A∗B∗≜C∗

which implies and , which leads to and for .

Thus . ∎

#### b.5 SDR expression in the strided case

Striding can be easily modeled, as it amounts to up-sampling the input image before convolution. We denote the up-sampling operation with integer factor999 is the inverse of the stride parameter; the latter is fractional in that case that turns the 2D activation map into

 x↑s[k,l]={x[k/s,l/s],k and l % multiple of s,0,otherwise.

leading to a compression of the normalized frequency axis in the Fourier domain such that . The convolution relation in Fourier domain thus translates to . As a consequence, the SDR measure needs to be adapted to up-sampling by rescaling the frequency axis of the activation map with respect to the one of the filter. Using power spectral density estimates based on Bartlett’s method, we use a batch of input images of size leading to values of activation map , , to obtain the following SDR estimate:

 (24)

#### b.6 Network hyper-parameters

Architecture Nb. of deconv. layers/channels of generator DCGAN VAE 4/(128,64,32,16,1) 4/(128,64,32,16,3) (4,8,16,32) (8,16,32,64) Adam (β=0.5) Adam (β=0.5) GAN loss VAE loss (Gaussian posteriors) 64 64 N/A 0.0005

#### b.7 FluoHair experiment

To obtain the result of Fig. 1, we proceeded as follows. We ran the clustering of hidden layer channels into modules encoding different properties, using the approach proposed by Besserve et al. (2020)

using the non-negative matrix factorization technique and chose a hyperparameter of 3 clusters. For the last hidden layer of the generator, we identified the channels belonging to the cluster encoding hair properties. We then identified and modified the tensor encoding the convolution operation of the last layer (mapping the last hidden layer to RGB image color channels, as described in Fig.

1), by changing the sign of the kernel coefficients corresponding to inputs originating form the identified hidden channels encoding hair. This generates pink hair. In order to change color to green (or blue), for the same coefficients, we permute in addition the targeted color channels (between red, green and blue).