SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

We propose a new technique, Singular Vector Canonical Correlation Analysis (SVCCA), a tool for quickly comparing two representations in a way that is both invariant to affine transform (allowing comparison between different layers and networks) and fast to compute (allowing more comparisons to be calculated than with previous methods). We deploy this tool to measure the intrinsic dimensionality of layers, showing in some cases needless over-parameterization; to probe learning dynamics throughout training, finding that networks converge to final representations from the bottom up; to show where class-specific information in networks is formed; and to suggest new training regimes that simultaneously save computation and overfit less. Code: https://github.com/google/svcca/

There are no comments yet.

Authors

• 16 publications
• 16 publications
• 25 publications
• 58 publications
• Insights on representational similarity in neural networks with canonical correlation

Comparing different neural network representations and determining how r...
06/14/2018 ∙ by Ari S. Morcos, et al. ∙ 0

• Sparsity-Probe: Analysis tool for Deep Learning Models

We propose a probe for the analysis of deep learning architectures that ...
05/14/2021 ∙ by Ido Ben-Shaul, et al. ∙ 27

• Neural Networks Regularization Through Class-wise Invariant Representation Learning

Training deep neural networks is known to require a large number of trai...
09/06/2017 ∙ by Soufiane Belharbi, et al. ∙ 0

• Large scale canonical correlation analysis with iterative least squares

Canonical Correlation Analysis (CCA) is a widely used statistical tool w...
07/16/2014 ∙ by Yichao Lu, et al. ∙ 0

• Neural Networks Trained to Solve Differential Equations Learn General Representations

We introduce a technique based on the singular vector canonical correlat...
06/29/2018 ∙ by Martin Magill, et al. ∙ 0

• On the infinite width limit of neural networks with a standard parameterization

There are currently two parameterizations used to derive fixed kernels c...
01/21/2020 ∙ by Jascha Sohl-Dickstein, et al. ∙ 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As the empirical success of deep neural networks (

hinton2012deep ; krizhevsky2012imagenet ; wu2016google ) become an indisputable fact, the goal of better understanding these models escalates in importance. Central to this aim is a core issue of deciphering learned representations. Facets of this key question have been explored empirically, particularly for image models, in alain2016understanding ; eigen2013understanding ; lenc2015understanding ; li2015convergent ; mahendran2015understanding ; montavon2011kernel ; simonyan2013deep ; yosinski2015understanding ; zeiler2014visualizing . Most of these approaches are motivated by interpretability of learned representations. More recently, li-2016-arXivICLR-convergent-learning:-do-different

studied the similarities of representations learned by multiple networks by finding permutations of neurons with maximal correlation.

In this work we introduce a new approach to the study of network representations, based on an analysis of each neuron’s activation vector

– the scalar outputs it emits on input datapoints. With this interpretation of neurons as vectors (and layers as subspaces, spanned by neurons), we introduce SVCCA, Singular Vector Canonical Correlation Analysis, an amalgamation of Singular Value Decomposition and Canonical Correlation Analysis (CCA)

ccapaper , as a powerful method for analyzing deep representations. Although CCA has not previously been used to compare deep representations, it has been used for related tasks such as computing the similarity between modeled and measured brain activity sussillo2015neural , and training multi-lingual word embeddings in language models faruqui2014improving .

The main contributions resulting from the introduction of SVCCA are the following:

1. We ask: is the dimensionality of a layer’s learned representation the same as the number of neurons in the layer? Answer: No. We show that trained networks perform equally well with a number of directions just a fraction of the number of neurons with no additional training, provided they are carefully chosen with SVCCA (Section 2.1). We explore the consequences for model compression (Section 4.4).

2. We ask: what do deep representation learning dynamics look like? Answer: Networks broadly converge bottom up. Using SVCCA, we compare layers across time and find they solidify from the bottom up. This suggests a simple, computationally more efficient method of training networks, Freeze Training, where lower layers are sequentially frozen after a certain number of timesteps (Sections 4.1, 4.2).

3. We develop a method based on the discrete Fourier transform which greatly speeds up the application of SVCCA to convolutional neural networks (Section

3).

4. We also explore an interpretability question, of when an architecture becomes sensitive to different classes. We find that SVCCA captures the semantics of different classes, with similar classes having similar sensitivities, and vice versa. (Section 4.3).

Experimental Details

Most of our experiments are performed on CIFAR-10 (augmented with random translations). The main architectures we use are a convolutional network and a residual network111Convnet layers: conv-conv-bn-pool-conv-conv-conv-bn-pool-fc-bn-fc-bn-out. Resnet layers: conv-(x10 c/bn/r block)-(x10 c/bn/r block)-(x10 c/bn/r block)-bn-fc-out.. To produce a few figures, we also use a toy regression task: training a four hidden layer fully connected network with 1D input and 4D output, to regress on four different simple functions.

2 Measuring Representations in Neural Networks

method

Our goal in this paper is to analyze and interpret the representations learned by neural networks. The critical question from which our investigation departs is: how should we define the representation of a neuron? Consider that a neuron at a particular layer in a network computes a real-valued function over the network’s input domain. In other words, if we had a lookup table of all possible mappings for a neuron, it would be a complete portrayal of that neuron’s functional form.

However, such infinite tables are not only practically infeasible, but are also problematic to process into a set of conclusions. Our primary interest is not in the neuron’s response to random data, but rather in how it represents features of a specific dataset (e.g. natural images). Therefore, in this study we take a neuron’s representation to be its set of responses over a finite set of inputs — those drawn from some training or validation set.

More concretely, for a given dataset and a neuron on layer , , we define to be the vector of outputs on , i.e.

 zli=(zli(x1),⋯,zli(xm))

Note that this is a different vector from the often-considered vector of the “representation at a layer of a single input.” Here is a single neuron’s response over the entire dataset, not an entire layer’s response for a single input. In this view, a neuron’s representation can be thought of as a single vector in a high-dimensional space. Broadening our view from a single neuron to the collection of neurons in a layer, the layer can be thought of as the set of neuron vectors contained within that layer. This set of vectors will span some subspace. To summarize:

Considered over a dataset with examples, a neuron is a vector in .

A layer is the subspace of spanned by its neurons’ vectors.

Within this formalism, we introduce Singular Vector Canonical Correlation Analysis (SVCCA) as a method for analysing representations. SVCCA proceeds as follows:

• Input: SVCCA takes as input two (not necessarily different) sets of neurons (typically layers of a network) and

• Step 1 First SVCCA performs a singular value decomposition of each subspace to get sub-subspaces which comprise of the most important directions of the original subspaces . In general we take enough directions to explain 99% of variance in the subspace. This is especially important in neural network representations, where as we will show many low variance directions (neurons) are primarily noise.

• Step 2 Second, compute the Canonical Correlation similarity (ccapaper ) of

to be as aligned as possible and compute correlation coefficients. In particular, given the output of step 1, , CCA linearly transforms these subspaces , such as to maximize the correlations between the transformed subspaces.

• Output: With these steps, SVCCA outputs pairs of aligned directions, and how well they correlate, . Step 1 also produces intermediate output in the form of the top singular values and directions.

For a more detailed description of each step, see the Appendix. SVCCA can be used to analyse any two sets of neurons. In our experiments, we utilize this flexibility to compare representations across different random initializations, architectures, timesteps during training, and specific classes and layers.

Figure 1 shows a simple, intuitive demonstration of SVCCA. We train a small network on a toy regression task and show each step of SVCCA, along with the resulting very similar representations. SVCCA is able to find hidden similarities in the representations.

2.1 Distributed Representations

An important property of SVCCA is that it is truly a subspace method: both SVD and CCA work with instead of being axis aligned to the directions. SVD finds singular vectors , and the subsequent CCA finds a linear transform , giving orthogonal canonically correlated directions . In other words, SVCCA has no preference for representations that are neuron (axes) aligned.

If representations are distributed across many dimensions, then this is a desirable property of a representation analysis method. Previous studies have reported that representations may be more complex than either fully distributed or axis-aligned szegedy2013intriguing ; zhou-2014-arXiv-object-detectors-emerge ; li-2016-arXivICLR-convergent-learning:-do-different but this question remains open.

We use SVCCA as a tool to probe the nature of representations via two experiments:

1. We find that the subspace directions found by SVCCA are disproportionately important to the representation learned by a layer, relative to neuron-aligned directions.

2. We show that at least some of these directions are distributed across many neurons.

Experiments for (a), (b) are shown in Figure 2 as (a), (b) respectively. For both experiments, we first acquire two different representations, , for a layer by training two different random initializations of a convolutional network on CIFAR-10. We then apply SVCCA to and to get directions and , ordered according to importance by SVCCA, with each being a linear combination of the original neurons, i.e. .

For different values of , we can then restrict layer ’s output to lie in the subspace of , the most useful -dimensional subspace as found by SVCCA, done by projecting each neuron into this dimensional space.

We find — somewhat surprisingly — that very few SVCCA directions are required for the network to perform the task well. As shown in Figure 2(a), for a network trained on CIFAR-10, the first 25 dimensions provide nearly the same accuracy as using all 512 dimensions of a fully connected layer with 512 neurons. The accuracy curve rises rapidly with the first few SVCCA directions, and plateaus quickly afterwards, for . This suggests that the useful information contained in neurons is well summarized by the subspace formed by the top SVCCA directions. Two baselines for comparison are picking random and maximum activation neuron aligned subspaces and projecting outputs onto these. Both of these baselines require far more directions (in this case: neurons) before matching the accuracy achieved by the SVCCA directions. These results also suggest approaches to model compression, which are explored in more detail in Section 4.4.

Figure 2(b) next demonstrates that these useful SVCCA directions are at least somewhat distributed over neurons rather than axis-aligned. First, the top SVCCA directions are picked and the representation is projected onto this subspace. Next, the representation is further projected onto neurons, where the are chosen as those most important to the SVCCA directions . The resulting accuracy is plotted for different choices of (given by x-axis) and different choices of (different lines). That, for example, keeping even 100 fc1 neurons (dashed green line) cannot maintain the accuracy of the first 20 SVCCA directions (solid green line at x-axis 20) suggests that those 20 SVCCA directions are distributed across 5 or more neurons each, on average. Figure 3 shows a further demonstration of the effect on the output of projecting onto top SVCCA directions, here for the toy regression case.

Why the two step SV + CCA method is needed.

Both SVD and CCA have important properties for analysing network representations and SVCCA consequently benefits greatly from being a two step method. CCA is invariant to affine transformations, enabling comparisons without natural alignment (e.g. different architectures, Section 4.4). See Appendix B for proofs and a demonstrative figure. While CCA is a powerful method, it also suffers from certain shortcomings, particularly in determining how many directions were important to the original space , which is the strength of SVD. See Appendix for an example where naive CCA performs badly. Both the SVD and CCA steps are critical to the analysis of learning dynamics in Section 4.1.

3 Scaling SVCCA for Convolutional Layers

Applying SVCCA to convolutional layers can be done in two natural ways:

1. Same layer comparisons: If are the same layer (at different timesteps or across random initializations) receiving the same input we can concatenate along the pixel (height , width ) coordinates to form a vector: a conv layer maps to vectors, each of dimension , where is the number of datapoints. This is a natural choice because neurons at different pixel coordinates see different image data patches to each other. When are two versions of the same layer, these different views correspond perfectly.

2. Different layer comparisons: When are not the same layer, the image patches seen by different neurons have no natural correspondence. But we can flatten an conv into neurons, each of dimension . This approach is valid for convs in different networks or at different depths.

3.1 Scaling SVCCA with Discrete Fourier Transforms

Applying SVCCA to convolutions introduces a computational challenge: the number of neurons () in convolutional layers, especially early ones, is very large, making SVCCA prohibitively expensive due to the large matrices involved. Luckily the problem of approximate dimensionality reduction of large matrices is well studied, and efficient algorithms exist, e.g. tropp2009randommatrices .

For convolutional layers however, we can avoid dimensionality reduction and perform exact SVCCA, even for large networks. This is achieved by preprocessing each channel with a Discrete Fourier Transform (which preserves CCA due to invariances, see Appendix), causing all (covariance) matrices to be block-diagonal. This allows all matrix operations to be performed block by block, and only over the diagonal blocks, vastly reducing computation. We show:

Theorem 1.

Suppose we have a translation invariant (image) dataset and convolutional layers , . Letting denote the discrete fourier transform applied to each channel of , the covariance is block diagonal, with blocks of size .

We make only two assumptions: 1) all layers below , are either conv or pooling layers with circular boundary conditions (translation equivariance) 2) The dataset has all translations of the images . This is necessary in the proof for certain symmetries in neuron activations, but these symmetries typically exist in natural images even without translation invariance, as shown in Figure App.2 in the Appendix. Below are key statements, with proofs in Appendix.

Definition 1.

Say a single channel image dataset of images is translation invariant if for any (wlog ) image , with pixel values , is also in , for all , where (and similarly for ).

For a multiple channel image , an translation is an height/width shift on every channel separately. is then translation invariant as above.

To prove Theorem 1, we first show another theorem:

Theorem 2.

Given a translation invariant dataset , and a convolutional layer with channels applied to

1. the DFT of , has diagonal covariance matrix (with itself).

2. the DFT of , , have diagonal covariance with each other.

Finally, both of these theorems rely on properties of circulant matrices and their DFTs:

Lemma 1.

The covariance matrix of applied to translation invariant is circulant and block circulant.

Lemma 2.

The DFT of a circulant matrix is diagonal.

4 Applications of SVCCA

4.1 Learning Dynamics with SVCCA

We can use SVCCA as a window into learning dynamics by comparing the representation at a layer at different points during training to its final representation. Furthermore, as the SVCCA computations are relatively cheap to compute compared to methods that require training an auxiliary network for each comparison alain2016understanding ; lenc2015understanding ; li-2016-arXivICLR-convergent-learning:-do-different , we can compare all layers during training at all timesteps to all layers at the final time step, producing a rich view into the learning process.

The outputs of SVCCA are the aligned directions , how well they align, , as well as intermediate output from the first step, of singular values and directions, , . We condense these outputs into a single value, the SVCCA similarity , that encapsulates how well the representations of two layers are aligned with each other,

 ¯ρ=1min(m1,m2)∑iρi, (1)

where is the size of the smaller of the two layers being compared. The SVCCA similarity is the average correlation across aligned directions, and is a direct multidimensional analogue of Pearson correlation.

The SVCCA similarity for all pairs of layers, and all time steps, is shown in Figure 4 for a convnet and a resnet architecture trained on CIFAR10.

4.2 Freeze Training

Observing in Figure 4 that networks broadly converge from the bottom up, we propose a training method where we successively freeze lower layers during training, only updating higher and higher layers, saving all computation needed for deriving gradients and updating in lower layers.

We apply this method to convolutional and residual networks trained on CIFAR-10, Figure 5, using a linear freezing regime: in the convolutional network, each layer is frozen at a fraction (layer number/total layers) of total training time, while for resnets, each residual block is frozen at a fraction (block number/total blocks). The vertical grey dotted lines show which steps have another set of layers frozen. Aside from saving computation, Freeze Training appears to actively help generalization accuracy, like early stopping but with different layers requiring different stopping points.

4.3 Interpreting Representations: when are classes learned?

We also can use SVCCA to compare how correlated representations in each layer are with the logits of each class in order to measure how knowledge about the target evolves throughout the network. In Figure 6 we apply the DFT CCA technique on the Imagenet Resnet resnet2015he . We take five different classes and for different layers in the network, compute the DFT CCA similarity between the logit of that class and the network layer. The results successfully reflect semantic aspects of the classes: the firetruck class sensitivity line is clearly distinct from the two pairs of dog breeds, and network develops greater sensitivity to firetruck earlier on. The two pairs of dog breeds, purposefully chosen so that each pair is similar to the other in appearance, have cca similarity lines that are very close to each other through the network, indicating these classes are similar to each other.

4.4 Other Applications: Cross Model Comparison and compression

SVCCA similarity can also be used to compare the similarity of representations across different random initializations, and even different architectures. We compare convolutional networks on CIFAR-10 across random initializations (Appendix) and also a convolutional network to a residual network in Figure 7, using the DFT method described in 3.

In Figure 3, we saw that projecting onto the subspace of the top few SVCCA directions resulted in comparable accuracy. This observations motivates an approach to model compression. In particular, letting the output vector of layer be , and the weights , we replace the usual with where is a projection matrix, projecting onto the top SVCCA directions. This bottleneck reduces both parameter count and inference computational cost for the layer by a factor . In Figure App.5 in the Appendix, we show that we can consecutively compress top layers with SVCCA by a significant amount (in one case reducing each layer to original size) and hardly affect performance.

5 Conclusion

In this paper we present SVCCA, a general method which allows for comparison of the learned distributed representations between different neural network layers and architectures. Using SVCCA we obtain novel insights into the learning dynamics and learned representations of common neural network architectures. These insights motivated a new Freeze Training technique which can reduce the number of flops required to train networks and potentially even increase generalization performance. We observe that CCA similarity can be a helpful tool for interpretability, with sensitivity to different classes reflecting their semantic properties. This technique also motivates a new algorithm for model compression. Finally, the “lower layers learn first” behavior was also observed for recurrent neural networks as shown in Figure

App.6 in the Appendix.

Appendix A Mathematical details of CCA and SVCCA

Canonical Correlation of X,y

Finding maximal correlations between can be expressed as finding to maximise:

 aTΣXYb√aTΣXXa√bTΣYYb

where are the covariance and cross-covariance terms. By performing the change of basis and

and using Cauchy-Schwarz we recover an eigenvalue problem:

 ~x1=\operatornamewithlimitsargmax⎡⎣xTΣ−1/2XXΣXYΣ−1YYΣYXΣ−1/2XXx||x||⎤⎦
Svcca

Given two subspaces , SVCCA first performs a singular value decomposition on . This results in singular vectors with associated singular values (for , and similarly for ). Of these singular vectors, we keep the top where is the smallest value that . That is, of the variation of is explainable by the top vectors. This helps remove directions/neurons that are constant zero, or noise with small magnitude.

Then, we apply Canonical Correlation Analysis (CCA) to the sets of top singular vectors.

CCA is a well established statistical method for understanding the similarity of two different sets of random variables – given our two sets of vectors

, we wish to find linear transformations, that maximally correlate the subspaces. This can be reduced to an eigenvalue problem. Solving this results in linearly transformed subspaces with directions that are maximally correlated with each other, and orthogonal to . We let . In summary, we have:

SVCCA Summary
1. Input:

2. Perform: SVD(X), SVD(Y). Output:

3. Perform CCA(, ). Output: , and

Appendix B Additional Proofs and Figures from Section 2.1

Proof of Orthonormal and Scaling Invariance of CCA:

We can see this using equation (*) as follows: suppose are orthonormal transforms applied to the sets . Then it follows that becomes , for , and similarly for and . Also note becomes . Equation (*) then becomes

 ~x1=\operatornamewithlimitsargmax⎡⎣xTUΣ−1/2XXΣXYΣ−1YYΣYXΣ−1/2XXUTx||x||⎤⎦

So if is a solution to equation (*), then is a solution to the equation above, which results in the same correlation coefficients.

b.0.1 The importance of SVD: how many directions matter?

While CCA is excellent at identifying useful learned directions that correlate, independent of certain common transforms, it doesn’t capture the full picture entirely. Consider the following setting: suppose we have subspaces , with being dimensions, being dimensions, of which are perfectly aligned with and the other being noise, and C being dimensions, of which are aligned with (and ) and the other being useful, but different directions.

Then looking at the canonical correlation coefficients of and will give the same result, both being for values and for everything else. But these are two very different cases – the subspace is indeed well represented by the directions that are aligned with . But the subspace has more useful directions.

This distinction becomes particularly important when aggregating canonical correlation coefficients as a measure of similarity, as used in analysing network learning dynamics. However, by first applying SVD to determine the number of directions needed to explain of the observed variance, we can distinguish between pathological cases like the one above.

Appendix C Proof of Theorem 1

Here we provide the proofs for Lemma 1, Lemma 2, Theorem 2 and finally Theorem 1.

A preliminary note before we begin:

When we consider a (wlog) by channel of a convolutional layer, we assume it has shape

 ⎡⎢ ⎢ ⎢ ⎢ ⎢⎣z0,0z1,2…z0,n−1z1,0z2,2…z1,n−1⋮⋮⋱⋮zn−1,0zn−1,1…zn−1,n−1⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

When computing the covariance matrix however, we vectorize by stacking the columns under each other, and call the result :

 vec(c)=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣z0,0z1,0⋮zn−1,0z0,1⋮zn−1,n−1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦:=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣z0z1⋮zn−1zn⋮zn2−1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

One useful identity when switching between these two notations (see e.g. [8]) is

 vec(AcB)=(BT⊗A)vec(c)

where are matrices and is the Kronecker product. A useful observation arising from this is:

Lemma 3.

The CCA vectors of are the same (up to a rotation by ) as the CCA of .

Proof: From Section B we know that unitary transforms only rotate the directions. But while DFT pre and postmultiplies by – unitary matrices, we cannot directly apply this as the result is for unitary transforms on . But, using the identity above, we see that , which is unitary as is unitary. Applying the same identity to , we can thus conclude that the DFT preserves CCA (up to rotations).

As Theorem 1 preprocesses the neurons with DFT, it is important to note that by the Lemma above, we do not change the CCA vectors (except by a rotation).

c.1 Proof of Lemma 1

Proof.

Translation invariance is preserved We show inductively that any translation invariant input to a convolutional channel results in a translation invariant output: Suppose the input to channel , ( by ) is translation invariant. It is sufficient to show that for inputs and , . But an shift in neuron coordinates in corresponds to a

height stride

a,width strideb) shift in the input. And as is translation invariant, there is some .

is circulant:

Let be (by proof above) a translation invariant input to a channel in some convolution or pooling layer. The empirical covariance, is the by matrix computed by (assuming is centered)

 1|X|∑Xi∈Xvec(c(Xi))⋅vec(c(Xi))T

So, , i.e. the inner products of the neurons and .

The indexes and refer to the neurons in their vectorized order in . But in the matrix ordering of neurons in , and correspond to some and . If we applied a translation , to both, we would get new neuron coordinates (all coordinates ) which would correspond to and , by our stacking of columns and reindexing.

Let be the translation in inputs corresponding to an translation in , i.e. . Then clearly , and similarly for

It follows that , or, with indexing

 1|X|zTizj=1|X|zT(i+an+bmodn2)z(j+an+bmodn2)

This gives us the circulant structure of .

is block circulant: Let be the ith column of , and the jth. In , these correspond to and , and the by submatrix at those row and column indexes of corresponds to the covariance of column . But then we see that the covariance of columns , corresponding to the covariance of neurons , and , which corresponds to the 2-d shift , applied to every neuron. So by an identical argument to above, we see that for all

 cov(z(i),z(j))=cov(z(i+k),z(j+k))

In particular, is block circulant. ∎

An example with being by look like below:

 ⎡⎢⎣A0A1A2A2A0A1A1A2A0⎤⎥⎦

where each is itself a circulant matrix.

c.2 Proof of Lemma 2

Proof.

This is a standard result, following from expressing a circulant matrix in terms of its diagonal form , i.e. with the columns of

being its eigenvectors. Noting that

, the DFT matrix, and that vectors of powers of , are orthogonal gives the result. ∎

c.3 Proof of Theorem 2

Proof.

Starting with (a), we need to show that is diagonal. But by the identity above, this becomes:

 cov(vec(DFT(ci)),vec(DFT(ci))=(F⊗F)vec(ci)vec(ci)T(F⊗F)∗

By Lemma 1, we see that

 cov(vec(ci))=vec(ci)vec(ci)T=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣A0A1…An−1An−1A0…An−2⋮⋮⋱⋮A1A2…A0⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

with each circulant.

And so becomes

 ⎡⎢ ⎢ ⎢ ⎢ ⎢⎣f00Ff01F…f0,n−1Ff10Ff11F…f1,n−1F⋮⋮⋱⋮fn−1,0Ffn−1,1F…fn−1,n−1F⎤⎥ ⎥ ⎥ ⎥ ⎥⎦⎡⎢ ⎢ ⎢ ⎢ ⎢⎣A0A1…An−1An−1A0…An−2⋮⋮⋱⋮A1A2…A0⎤⎥ ⎥ ⎥ ⎥ ⎥⎦⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣f∗00F∗f∗10F∗…f∗n−1,0F∗f∗01F∗f∗11F∗…f∗n−1,1F∗⋮⋮⋱⋮f∗0,n−1F∗f∗1,n−1F∗…f∗n−1,n−1F∗⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

From this, we see that the th entry has the form

 n−1∑l=0(n−1∑k=0fskFAl−k)f∗ljF∗=∑k,lfskf∗ljFAl−kF∗

Letting denote the coefficient of the term , we see that (addition being )

 [FArF∗]=n−1∑k=0fskf∗(k+r)j=∑ke2πiskn⋅e−2πij(k+r)n=e−2πijrnn−1∑k=0e2πik(s−j)n=e−2πijrn⋅δsj

with the last step following by the fact that the sum of powers of non trivial roots of unity are .

In particular, we see that only the diagonal entries (of the by matrix of matrices) are non zero. The diagonal elements are linear combinations of terms of form , and by Lemma 2 these are diagonal. So the covariance of the DFT is diagonal as desired.

Part (b) follows almost identically to part (a), but by first noting that exactly by the proof of Lemma 1, is also a circulant and block circulant matrix.

c.4 Proof of Theorem 1

Proof.

This Theorem now follows easily from the previous. Suppose we have a layer , with channels . And let have directions . By the previous theorem, we know that the covariance of all of these neurons only has non-zero terms .

So arranging the full covariance matrix to have row and column indexes being the nonzero terms all live in the by blocks down the diagonal of the matrix, proving the theorem. ∎

c.5 Computational Gains

As the covariance matrix is block diagonal, our more efficient algorithm for computation is as follows: take the DFT of every channel ( due to FFT) and then compute covariances according to blocks: partition the directions into the by matrices that are non-zero, and compute the covariance, inverses and square roots along these.

A rough computational budget for the covariance is therefore , while the naive computation would be of order , a polynomial difference. Furthermore, the DFT method also makes for easy parallelization as each of the blocks does not interact with any of the others.

Appendix E Additional Figure from Section 4.4

Figure App.4 compares the converged representations of two different initializations of the same convolutional network on CIFAR-10.