# A Provably Correct Algorithm for Deep Learning that Actually Works

We describe a layer-by-layer algorithm for training deep convolutional networks, where each step involves gradient updates for a two layer network followed by a simple clustering algorithm. Our algorithm stems from a deep generative model that generates mages level by level, where lower resolution images correspond to latent semantic classes. We analyze the convergence rate of our algorithm assuming that the data is indeed generated according to this model (as well as additional assumptions). While we do not pretend to claim that the assumptions are realistic for natural images, we do believe that they capture some true properties of real data. Furthermore, we show that our algorithm actually works in practice (on the CIFAR dataset), achieving results in the same ballpark as that of vanilla convolutional neural networks that are being trained by stochastic gradient descent. Finally, our proof techniques may be of independent interest.

## Authors

• 9 publications
• 29 publications
• ### Overcoming Challenges in Fixed Point Training of Deep Convolutional Networks

It is known that training deep neural networks, in particular, deep conv...
07/08/2016 ∙ by Darryl D. Lin, et al. ∙ 0

• ### On the Convergence Rate of Training Recurrent Neural Networks

Despite the huge success of deep learning, our understanding to how the ...
10/29/2018 ∙ by Zeyuan Allen-Zhu, et al. ∙ 0

• ### Gossip training for deep learning

We address the issue of speeding up the training of convolutional networ...
11/29/2016 ∙ by Michael Blot, et al. ∙ 0

• ### A Convergence Theory for Deep Learning via Over-Parameterization

Deep neural networks (DNNs) have demonstrated dominating performance in ...
11/09/2018 ∙ by Zeyuan Allen-Zhu, et al. ∙ 0

• ### The Gaussian equivalence of generative models for learning with two-layer neural networks

Understanding the impact of data structure on learning in neural network...
06/25/2020 ∙ by Sebastian Goldt, et al. ∙ 87

• ### Denoising and Regularization via Exploiting the Structural Bias of Convolutional Generators

Convolutional Neural Networks (CNNs) have emerged as highly successful t...
10/31/2019 ∙ by Reinhard Heckel, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The success of deep convolutional neural networks (CNN) has sparked many works trying to understand their behavior. We can roughly separate these works into three categories: First, the majority of the works focus on providing various optimization methods and algorithms that prove well in practice, but have almost no theoretical guarantees. A second class of works focuses on analyzing practical algorithms (mostly SGD), but under strong assumptions on the data distribution, like linear separability or sampling from Gaussian distribution, that often make these problems trivially solvable by much simpler algorithms. A third class of works takes less restrictive assumptions on the data, provides strong theoretical guarantees, but these guarantees hold for algorithms that don’t really work in practice.

In this work, we study a new algorithm for learning deep convolutional networks, assuming the data is generated from some deep generative model. This model assumes that the examples are generated in a hierarchical manner: each example (image) is generated by first drawing a high-level semantic image, and iteratively refining the image, each time generating a lower-level image based on the higher-level semantics from the previous step. Similar models were suggested in other works as good descriptions of natural images encountered in real world data. These works, although providing important insights, suffer from one of two major shortcomings: they either suggest algorithms that seem promising for practical use, but without any theoretical guarantees, or otherwise provide algorithms with sound theoretical analysis that seem far from being applicable for learning real-world data.

Our work achieves promising results in the following sense: first, we show an algorithm along with a complete theoretical analysis, proving it’s convergence under the assumed generative model (as well as additional, admittedly strong, assumptions). Second, we show that implementing the algorithm to learn real-world data achieves performance that are in the same ballpark as the popular CNN trained with SGD-based optimizers. Third, the problem on which we apply our algorithm is not trivially learned by simple “shallow” learning algorithms. The main achievement of this paper is succeeding in all of these goals together. As is usually the case in tackling hard problems, our theoretical analysis makes strong assumptions on the data distribution, and we clearly state them in our analysis. Nevertheless, the resulting algorithm works on real data (where the assumptions clearly do not hold). That said, we do not wish to claim that such algorithm achieves state-of-the-art results, and hence did not apply many of the common “tricks” that are used in practice to train a CNN, but rather compared our algorithm to an “out-of-the-box” SGD-based optimization.

## 2 Related Work

As mentioned, we can roughly divide the works relevant to the scope of this paper into three categories: (1) works that study practical algorithms (SGD) solving “simple” problems that can be otherwise learned with “shallow” algorithms. (2) works that study problems with less restrictive assumptions, but using algorithms that are not applicable in practice. (3) works that study a generative model similar to ours, but either give no theoretical guarantees, or otherwise analyze an algorithm that is “tailored” to learning the generative model, and seems very far from algorithms used in practice.

Trying to study a practically useful algorithm, [5] proves that SGD learns a function that approximates the best function in the conjugate kernel space derived from the network architecture. Although this work provides guarantees for a wide range of deep architectures, there is no empirical evidence that the best function in the conjugate kernel space performs at the same ballpark as CNNs. The work of [1] shows guarantees on learning low-degree polynomials, which is again learnable via SVM or direct feature mapping. Other works study shallow (one-hidden-layer) networks under some significant assumptions. The works of [8, 4]

study the convergence of SGD trained on linearly separable data, which could be learned with the Perceptron algorithm, and the works of

[3, 15, 10, 20] assume that the data is generated from Gaussian distribution, an assumption that clearly does not hold in real-world data. The work of [6] extends the results in [3], showing recovery of convolutional kernels without assuming Gaussian distribution, but is still limited to the regime of shallow two-layer network.

Another line of work aims to analyze the learning of deep architectures, in cases that exceed the capacity of shallow learning. The works of [11, 18, 17] show polynomial-time algorithms aimed at learning deep models, but that seem far from performing well in practice. The work of [19] analyses a method of learning a model similar to CNN which can be applied to learn multi-layer networks, but the analysis is limited to shallow two-layer settings, when the formulated problem is convex.

Finally, there have been a few works suggesting distributional assumptions on the data that are similar in spirit to the generative model that we analyze in this paper. Again, these works can be largely categorized into two classes: works that provide algorithms with theoretical guarantees but no practical success, and works that show practical results without theoretical guarantees. The work of [2] shows a provably efficient algorithm for learning a deep representation, but this algorithm seems far from capturing the behavior of algorithms used in practice. Our approach can be seen as an extension of the work of [12], who studies Hierarchal Generative Models, focusing on algorithms and models that are applicable to biological data. [12] suggests that similar models may be used to define image refinement processes, and our work shows that this is indeed the case, while providing both theoretical proofs and empirical evidence to this claim. Finally, the works of [14, 13, 16] study generative models similar to ours, with promising empirical results when implementing EM inspired algorithms, but giving no theoretical foundations whatsoever.

## 3 Generative Model

We begin by introducing our generative model. This model is based on the assumption that the data is generated in a hierarchical manner. For each label, we first generate a high-level semantic representation, which is simply a small scale image, where each “pixel” represents a semantic class (in case of natural images, these classes could be: background, sky, grass etc.). From this semantic image, we generate a lower level image, where each patch comes from a distribution depending on each “pixel” of the high-level representation, generating a larger semantic image (lower level semantic classes for natural images could be: edges, corners, texture etc.). We can repeat this process iteratively any number of times, each time creating a larger image of lower level semantic classes. Finally, to generate a greyscale or RGB image, we assume that the last iteration of this process samples patches over . This model is described schematically in Figure 1, with a formal description given in Section 3.1. Section 3.2 describes a synthetic example of digit images generated according to this model.

### 3.1 Formal Description

To generate an example, we start with sampling the label , where

is the uniform distribution over the set of labels. Given

, we generate a small image with pixels, where each pixel belongs to a set . Elements of corresponds to semantic entities (e.g. “sky”, “grass”, etc.). The generated image, denoted , is sampled according to some simple distribution (to be defined later). Next, we generate a new image, as follows. Pixel in corresponds to some . For every such , there is a distribution over , where we refer to as a “patch size”. So, pixel in whose value is generates a patch of size in by sampling the patch according to . This process continues, which yields images whose sizes are , where each pixel in layer comes from . We assume that , hence the final image is over the reals. The resulting example is the pair . We denote the distribution generating the image of level by .

### 3.2 Synthetic Digits Example

To demonstrate our generative model, we use a small synthetic example to generate images of digits. In this case, we use a three levels model, where semantic classes represent lines, corners etc. In the notations above, we use:

 C0={\includegraphics[scale=0.02]figsym00,\includegraphics[scale=0.02]figsym01,\includegraphics[scale=0.02]figsym02,\includegraphics[scale=0.02]figsym03,\includegraphics[scale=0.02]figsym04,\includegraphics[scale=0.02]figsym05,\includegraphics[scale=0.02]figsym06,\includegraphics[scale=0.02]figsym07,\includegraphics[scale=0.02]figsym08}  ,  C1={\includegraphics[scale=0.02]figsym10,\includegraphics[scale=0.02]figsym11,\includegraphics[scale=0.02]figsym12,\includegraphics[scale=0.02]figsym13,\includegraphics[scale=0.02]figsym14,\includegraphics[scale=0.02]figsym15}  , C2=R

We define the distributions to be the distributions concentrated on the equivalent digital representation:  ,  ,  ,  ,  ,  ,  ,  ,  ,  .

Now, in the second level of the generative model, each pixel in can generate one of four possible manifestations. For example, for the pixel , we sample over:  ,  ,  ,  . Similarly, in the final level we sample for each from a distribution supported over 4 elements. For example, for the pixel , we sample over:  ,  ,  ,  .

Notice that though this example is extremely simplistic, it can generate examples per digit in the first level, and examples for each digit in the final layer, amounting to different examples. Figure 2 shows the process output.

## 4 Algorithm

Assume we are given data from the generative distribution described in Section 3

, our goal is to learn a classifier that predicts the label for each image. A natural approach would be to try to learn for each low-level patch, the semantic class (in the higher-level semantic image) from which it was generated. This way, we could cluster together semantically related patches, exposing the higher-level semantic image that generated the lower-level image. If we succeed in doing so multiple times, we can infer the topmost semantic image in the hierarchy. Assuming the high-level distribution

is simple enough (for example, a linearly separable distribution with respect to some embedding of the classes), we could then use a simple classification algorithm on the high-level image to infer its label.

Unfortunately, we cannot learn these semantic classes directly as we are not given access to the latent semantic images, but only to the lowest-level image generated by the model. To learn these classes, we use a combination of a simple clustering algorithm and a gradient-descent based algorithm that learns a single layer of a convolutional neural network. Surprisingly, as we show in the theoretical section, the gradient-descent finds an embedding of the patches such that patches from the same class are close to each other, while patches from different classes are far away. The clustering step then clusters together patches from the same class.

### 4.1 Algorithm Description

The algorithm we suggest is built from three building-blocks composed together to construct the full algorithm: (1) clustering algorithm, (2) gradient-based optimization of two-layer Conv net and (3) a simple classification algorithm. In order to expose the latent representation of each layer in the generative model, we perform the following iteratively:
(1) Run a centroid-based clustering algorithm on the patches of size from the input image defined by the previous step (or the original image in the first step), w.r.t. the cosine distance, to get cluster centers.
(2)

Run a convolution operation with the cluster centroids as kernels, followed by ReLU with a fixed bias and a pooling operation. This will result in mapping the patches in the input images to (approximately) orthogonal vectors in an intermediate space

.
(3) Initialize a 1x1 convolution operation, that maps from channels into channels, followed by a linear layer that will output channels (where it’s input is the tensor flattened into a vector). We train this two-layer subnet using a gradient-based optimization method. As we show in the analysis, this step implicitly learns an embedding of the patches into a space where patches from the same semantic class are close to each other, while patches from different classes are far away, hence laying the ground for the clustering step of the next iteration.
(4) “Throw” the last linear layer, thus leaving a trained block of Conv-ReLU-Pool-Conv which finds a “good” embedding of the patches of the input image, and repeat the process again, where the output of this block is the input to step 1.

Finally, after we perform this process for times, we get a network of depth composed from Conv-ReLU-Pool-Conv blocks. Then, we feed the output of this (already trained) network to some classifier, training it to infer the label from the semantic representation that the convolution network outputs. This training is done again using a gradient-based optimization algorithm. We now describe the building blocks for the algorithm, followed by the definition of the complete algorithm.

#### 4.1.1 Clustering

The first block of the algorithm is the clustering step. We denote to be any polynomial time clustering algorithm, such that given a sample , the algorithm outputs a mapping , satisfying that for every , if then , and if then . Notice that this clustering could be a trivial clustering algorithm: for each example, we cluster together all the examples that are within distance from it, mapping different clusters to orthogonal vectors in . Thus, we take to be the number of clusters found in .

For the consistency with common CNN architecture, we can use a centroid-based clustering algorithm that outputs the centroid of each cluster, using these cetnroids as kernels for a convolution operation. Combining this with ReLU with a fixed bias and a pooling operation gives an operation that maps each patch to a single vector, where vectors of different patches are approximately orthogonal.

#### 4.1.2 Two-Layer Network Algorithm

The second building-block of our main algorithm is a gradient-based optimization algorithm that is used to train a two-layer convolutional subnet. In this paper, we define a convolutional subnet to be a function defined by:

 NK,W(X)=⟨W⊤,K⊤X⟩

Where we define the inner product between matrices as .

This is equivalent to a convolution operation on an image, followed by a linear weighted sum: assume is the matrix where each column is a patch in an image (the “im2col” operation), then multiplying this matrix by is equivalent to performing a convolution operation with kernels on the original image (where we denote to be the -th vector of matrix ). Flattening the resulting matrix and multiplying by the weights in yields the second linear layer.

The top linear layer of the network outputs a prediction for the label , and is trained with respect to the loss on a given set of examples , defined as:

 LSK,W=E(X,y)∼S[ℓy(NK,W(X))]

For some loss function

.

After removing the top linear layer (which is used only to train the convolutional layer), this algorithm will output the matrix . This matrix is a set of 1x1 convolution kernels learned during the optimization, that are used on top of the previous operations. We denote the algorithm that trains a two-layer network of width , on sample for iterations with learning rate , randomly initializing from some distribution with parameter (described in details in the theoretical section). This algorithm outputs the Conv1x1 kernels learned.

As we show in our theoretical analysis, running a gradient-based algorithm will implicitly learn an embedding that maps patches from the same class to similar vectors, and patches from different classes to vectors that are far away.

#### 4.1.3 Classification Algorithm

Finally, the last building block of the algorithm is a classification stage, that is used on top of the deep convolution architecture learned in the previous steps. We consider some hypothesis space (for example linear separators). Denote CLS a polynomial time classification algorithm, such that given a sample , the algorithm outputs some hypothesis . Again, we can assume this algorithm is trained using a gradient-based optimization algorithm, to infer the label based on the high-level semantics generated by the deep convolutional network trained in the previous steps.

#### 4.1.4 Complete Algorithm

Utilizing the building blocks described previously, our algorithm learns a deep CNN layer after layer. This network is used to infer the label for each image. This algorithm is described formally in Algorithm 1. In the description, we use the notation to denote the operation of applying a map on a tensor , replacing patches of size by vectors in . Formally:

 ϕ∗A:=[ϕ(A:,i⋅m0…(i+1)⋅m0)]i

## 5 Theoretical Analysis

In this section we prove that, under some assumptions, the algorithm described in Algorithm 1

learns (with high probability) a network model that correctly classifies the examples according to their labels. The structure of this section is as follows. We first introduce our assumptions on the data distribution as well as on the specific implementation of the algorithm. Next, we turn to the analysis of the algorithm itself, starting with showing that the sub module of training a two-layer network implicitly learns an embedding of the patches into a space where patches from a similar semantic class are close to each other, while patches from different classes are far apart. Using this property, we show that even a trivial clustering algorithm manages to correctly cluster the patches. Finally, we prove that performing these two steps (two-layer network + trivial clustering) iteratively, layer by layer, leads to revealing the underlying model.

### 5.1 Assumptions

Our analysis relies on several assumptions on the data distribution, as well as on the suggested implementation of the algorithm. These assumptions are necessary for the theorems to hold, and admittedly are far from being trivial. We believe that some of the assumptions can be relaxed on the expense of a much more complicated proof.

#### 5.1.1 Distributional Assumptions

For simplicity, we focus on binary classification problems, namely, . The extension to multi-class problems is straightforward. We assume that the sets of semantic classes are finite, and the final (observed) image is over the reals, i.e .

We assume the following is true for all . For all the distribution of the patches in the lower-level image for pixels of value , denoted , is a uniform distribution over a finite set of patches . We further assume that all these sets are disjoint and are of fixed size .

For every we denote by the operator that takes a tensor (of some dimension) as its input and replaces every element of the input by the boolean that indicates whether it equals to .

We introduce the notation: .

Notice that is the “mean” image over the distribution for every given semantic class, . For example, semantic classes that tend to appear in the upper-left corner of the image for positive images will have positive values in the upper-left entries of . As will be explained later, these images play a key role in our analysis.

For our analysis to follow, we assume that the vectors are linearly independent. For each we denote the angle between and by:

 ∠(vκc1,vκc2):=arccos(vκc1⋅vκc2∥vκc1∥∥vκc2∥)∈[0,π]

Denote and . From the linear independence assumption it follows that both and are strictly positive. The convergence of the algorithm depends on and .

#### 5.1.2 High Level Efficient Learnability

The essence of the problem to learn the mapping from the images in to the labels is that we do not observe the high level semantic image in . To make this distinction clear, we assume that, had we were given the semantic images in , then the learning problem would have been easy. Formally, there exists a classification algorithm, denoted CLS, that upon receiving an i.i.d. training set of polynomial size from the distribution over , it returns (with high probability, and after running in polynomial time) a classifier whose error is at most .

#### 5.1.3 Assumptions on the Implementation of the Two Layers Building Block

For the analysis, we train the two-layer network with respect to the loss . This loss simplifies the analysis, and seems to capture a similar behavior to other loss types used in practice.

Although in practice we perform a variant of SGD on a sample of the data to train the network, we perform the analysis with respect to the population loss: . We denote the weights of the first layer of the network in iteration , and denote the initial weights of the second layer. For simplicity of the analysis, we assume that only the first layer of the network is trained, while the weights of the second layer are fixed. Thus, we perform the following update step at each iteration of the gradient descent: . Applying this multiple times trains a network, denoted .

As for the initialization of , assume we initialize each column of from a uniform distribution on a sphere whose radius is at most , where is a parameter of the algorithm and is the number of columns of . We initialize .

### 5.2 Two-Layer Algorithm

In this part of the analysis we limit ourselves to observing the properties of the two-layer network trained in the iteration of the main algorithm. Hence, we introduce a few simple notations to make the analysis clearer. We first assume that we are given some mapping from patches to , denoted , such that

is a set of orthonormal vectors (this mapping is learned by the previous steps of the algorithm, as we show in Section

5.3). Assume we observe the distribution . Recall that is generated from the latent distribution over higher level images. We overload the notation and use to denote the size of the semantic images from the higher level of the model (namely, ). Thus, is a distribution over .

Now, we can “forget” the intermediate latent distributions , and assume the distribution is given by sampling and then by sampling , where and we use to denote the distribution conditioned on . Finally, we denote the distribution conditioned on . Thus, we can describe the sampling from schematically by:

 ⇝U(Y)y⇝Dy(z,y)⇝Gz(X,y)∈Rℓ×m×{±1}

We denote , which is the set of the semantic classes of the images in the latent distribution . For every denote , which is the application of on the set of patches in the observed distribution generated from the semantic class . Notice that from the assumption on it follows that is a set of orthonormal vectors. Denote , the “mean” image of the semantic class . The following diagram describes the process of generating from , with being the example generated by distribution (the example which is embedded into the observed space ):

 ⎡⎢ ⎢⎣z1⋮zm⎤⎥ ⎥⎦z∈Cm⇝⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣~x1⋮~xs⋮~xms−s+1⋮~xms⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦~x∈Cmsκim2col↦⎡⎢ ⎢⎣~x1…~xms−s+1⋮⋮~xs…~xms⎤⎥ ⎥⎦ϕ↦⎡⎢ ⎢ ⎢ ⎢ ⎢⎣x(1)∈Sz1ϕ⎛⎜ ⎜⎝~x1⋮~xs⎞⎟ ⎟⎠…ϕ⎛⎜ ⎜⎝~xms−s+1⋮~xms⎞⎟ ⎟⎠x(m)∈Szm⎤⎥ ⎥ ⎥ ⎥ ⎥⎦X∈Rℓ×m

Now, we can introduce the main theorem of this section. This theorem states that training the two-layer Conv net as defined previously will implicitly learn an embedding of the observed patches into a space such that patches from the same semantic class are close to each other, while patches from different classes are far. Recall that we do not have access to the latent distribution, and thus cannot possibly learn such embedding directly. Therefore, this surprising property of gradient descent is the key feature that allows our main algorithm to learn the high-level semantics of the images.

###### Theorem 1

Let be as described in Section 5.1.1. Assume we train a two-layer network of size with respect to the population loss on distribution , with learning rate , for iterations. Assume that the training is as described in Section 5.1.3, where the parameter of the initialization is also described there. Then with probability of at least :

1. for each , for every we get

2. for , if , for every , we get

For lack of space, we give the full proof of this theorem in the appendix, and give a rough sketch of the proof here: Observe the value of , which is the activation of a kernel in the first layer, denoted , operated on a given patch . Due to the gradient descent update rule, this value changes by at each iteration. Analyzing this gradient shows that this expression, i.e the change in the activation , is in fact proportional to . In other words, the value of is the only factor that dominates the behavior of the gradient with respect to . Hence, the activation of two patches generated from the same class will behave similarly throughout the training process. Furthermore, for patches from different classes , if we happen to get: , due to the random initialization (which will happen in sufficient probability), then the activations of patches from class and patches from class will go in opposite directions, and after enough iterations will be far apart.

To give more intuition as to why the proof works, we can look at the whole process from a different angle. For two patches in an image sampled from a given distribution, we can look at two measures of similarity: First, we can observe a simple “geometric” similarity, like the distance between these two patches. Second, we can define a “semantic” similarity between patches to be the similarity between the distribution of occurrences of each patch across the image (i.e, patches that tend to appear more in the upper part of the image for positive labels are in this sense “semantically” similar). In our case, we show that the vector gives us exactly this measure of similarity: two patches from the same class are semantically similar in the sense that their mean distribution in the image is exactly the same image, denoted . Given this notion, we can see why our full algorithm works: the clustering part of the algorithm merges together geometrically similar patches, while the gradient descent algorithm maps semantically similar patches to geometrically similar vectors, allowing the clustering of the next iteration to perform clustering based again on the simple geometrical distance. Note that while the technical proof heavily relies on our assumptions, the intuitions above may hold true for real data.

### 5.3 Full Network Training

In this section, we analyze the convergence of the full algorithm described in Algorithm 1, where our main claim is that this algorithm successfully learns a model that classifies the examples sampled from the observed distribution . Formally, our main claim is given in the following theorem:

###### Theorem 2

Suppose that the assumptions given in Section 5.1 hold. Fix , and let . Denote the maximal number of semantic classes in each level . Let denote the minimal distance between any two possible different patches in the observed images in . Choose , , , . Then, with probability , running Algorithm 1 with parameters on data from distribution returns hypothesis such that .

To show this, we rely on the result of Section 5.2, which guaranties that the embedding learned by the network at each iteration maps patches from the same class to similar vectors. Now, recall that our model assumes that a single pixel in a high-level image is “manifested” as a patch in the lower-level image. Thus, a patch of size in the higher-level image is manifested in patch in the lower-level image, and many such manifestations are possible. Thus, the fact that we find such “good” embedding allows our simple clustering algorithm to cluster together different low-level manifestations of a single high-level patch. Hence, iteratively applying this embedding and clustering steps allows to decode the topmost semantic image, which can be then classified by our simple classification algorithm.

Before we show the proof, we remind a few notations that were used in the algorithm’s description. We use to denote the clustering of patches learned in the iteration of the algorithm, and the weights of the kernels learned BEFORE the step (thus, the patches mapped by are the input to the clustering algorithm that outputs ). Note that in every step of the algorithm we perform a clustering on patches of size in the current latent image, while at the last step we cluster only patches of size (i.e, cluster the vectors in the “channels” dimension). This is because after the final iteration we have a mapping of the distribution , where patches of the same class are mapped to similar vectors. To generate a mapping of , we thus only need to cluster these vectors together, to get orthonormal representations of each class. Finally, we use the notations to indicate that we operate on every patch of the tensor . When we use operations on distributions, for example or , we refer to the new distribution generated by applying these operation to every examples sampled from . The essence of the proof is the following lemma:

###### Lemma 1

Let be the distribution over pairs , where is the observed image over the reals, and recall that for , the distribution is over pairs where is in a space of latent semantic images over . For every , with probability at least , there exists an orthonormal patch mapping such that , where and are as defined in Algorithm 1.

The proofs of the lemma and of Theorem 2 are given in Appendix B. The lemma tells us that the neural network at step of the algorithm reveals (in some sense) the latent semantic structure.

## 6 Experiments

As mentioned before, our analysis relies on distributional assumptions formalized in the generative model we suggest. A disadvantage of such analyses is that the assumptions rarely hold for real-world data, as the distribution of natural images is far more complex. The goal of this section is to show that when running our algorithm on CIFAR-10, the performance of our model is in the same ballpark as a vanilla CNN, trained with a common SGD-based optimization algorithm. Hence, even though the data distribution deviates from our assumptions, our algorithm still achieves good performance.

We chose the CIFAR-10 problem, as a rich enough dataset of natural images. As our aim is to show that our algorithm achieves comparable result to a vanilla SGD-based optimization, and not to achieve state-of-the-art results on CIFAR-10, we do not use any of the common “tricks” that are widely used when training deep networks (such as data augmentation, dropout, batch normalization, scheduled learning rate, averaging of weights across iterations etc.). We implemented our algorithm by repeating the following steps twice: (1) Sample

patches of size 3x3 uniformly from the dataset. (2) For some

, run the K-means algorithm to find

cluster centers . (3) At this step, we need to associate each cluster with a vector in , such that the image of this mapping is a set of orthonormal vectors, and then map every patch in every image to the vector corresponding to the cluster it belongs to. We do so by performing Conv3x3 layer with the kernels , and then perform ReLU operation with a fixed bias . This roughly maps each patch to the vector , where

is the cluster the patch belongs to. (4) While our analysis corresponds to performing the convolution from the previous step with a stride of

, to make the architecture closer to the commonly used CNNs (specifically the one suggested in the Tensorflow implementation

[7]), we used a stride of

followed by a 2x2 max-pooling. (5) Randomly initialize a two layered linear network, where the first layer is Conv1x1 with

output channels, and the second layer is a fully-connected Affine layer that outputs 10 channels to predict the 10 classes of CIFAR-10. (6) Train the two-layers with Adam optimization (

[9]) on the cross-entropy loss, and remove the top layer. The output of the first layer is the output of these steps.

Repeating the above steps twice yields a network with two blocks of Conv3x3-ReLU-Pool-Conv1x1. We feed the output of these steps to a final classifier that is trained again with Adam on cross entropy loss for 100k iterations, to output the final classification of this model. We test two choices for this classifier: a linear classifier and a three-layers fully-connected neural network. Note that in both cases, the output of our algorithm is a vanilla CNN. The only difference is that it was trained differently. To calibrate the various parameters that define the model, we first perform random parameter search, where we use 10k examples from the train set as validation set (and the rest 40k as train set). After we found the optimal parameters for all the setups we compare, we then train the model again with the calibrated parameters on all the train data, and plot the accuracy on the test data every 10k iterations. The parameters found in the parameter search are listed in Appendix C.

We compared our algorithm to several alternatives. First, the standard CNN configuration in the Tensorflow implementation with two variants: CNN+(FC+ReLU) is the Tensorflow architecture and CNN+Linear is the Tensorflow architecture where the last three fully connected layers were replaced by a single fully connected layer. The goal of this comparison is to show that the performance of our algorithm is in the same ballpark as that of vanilla CNNs. Second, we use the same two architectures mentioned before, but while using random weights for the CNN and training only the FC layers. Some previous analyses of the success of CNN claimed that the power of the algorithm comes from the random initialization, and only the training of the last layer matters. As is clearly seen, random weights are far from the performance of vanilla CNNs. Our last experiment aims at showing the power of the two layer training in our algorithm (step 6). To do so, we compare our algorithm to a variant of it, in which step 6 is replaced by random projections (based on Johnson-Lindenstrauss lemma). We denote this variant by Clustering+JL. As can be seen, this variant gives drastically inferior results, showing that the training step of Conv1x1 is crucial, and finds a “good” embedding for the process that follows, as is suggested by our theoretical analysis. A summary of all the results is given in Figure 3.

##### Acknowledgements:

This research is supported by the European Research Council (TheoryDL project).

## Appendix A Proof of Theorem 1

For some class and for some patch , denote a function that takes a matrix and returns a vector such that the ’th element of is the if the ’th column of , denoted , equals to and otherwise. That is,

 fx′(X):=⎡⎢ ⎢⎣1x(1)=x′⋮1x(m)=x′⎤⎥ ⎥⎦

Notice that from the orthonormality of the observed columns of it follows that: .

We begin with proving the following technical lemma.

###### Lemma 2

For each and for each we have:

 E(X,y)∼G[−yfx′(X)]=1dvc

Proof  Observe that

 E(X,y)∼G[−yfx′(X)]=12∑y=±1−yEz∼Dy[EX∼Gz[fx′(X)]]

Therefore, for each we have:

 =12∑y=±1−yEz∼Dy[Ex∼Gz[1x(j)=x′]] =12∑y=±1−yEz∼Dy[Px∼Gz(x(j)=x′)] =12∑y=±1−yEz∼Dy[1d1zj=c] =1d⋅12∑y=±1−yEz∼Dy[Fc(z)j]=1d[vc]j

The next lemma reveals a surprising connection between the gradient and the vectors .

###### Lemma 3

for every and for every :

 x′∂∂k(i)LK,W=1dw(i)⋅vc

Proof  For a fixed and , denote . Note that:

 ∂∂k(i)^y(K)=Xw(i)

So for we have:

 x′⋅∂∂k(i)^y=(x′)⊤Xw(i)=w(i)0⋅fx′(X)

Combining the above with the definition of the loss function, , and with Lemma 2 we get:

 x′∂∂k(i)LKt,W0 =E(X,y)∼G[x′∂∂k(i)ℓy(^y)] =E(X,y)∼G[−yx′∂∂ki^y] =E(X,y)∼G[−yw(i)0⋅fx′(X)] =w(i)0⋅E(X,y)∼G[−yfx′(X)] =1dw(i)0⋅vc

As an immediate corollary we obtain that a gradient step does not change the projection of the kernel on two vectors that correspond to the same class (both are in the same ).

###### Corollary 1

For every , , for every semantic class and for every it holds that: .

Proof  From Lemma 3 we can conclude that for a given , for every we get:

 x1∂∂k(i)LKt,W0=x2∂∂k(i)LKt,W0

From the gradient descent update rule:

 k(i)t+1=k(i)t−η∂∂k(i)LKt,W0

And therefore:

 |k(i)t+1⋅x1−k(i)t+1⋅x2| =|(k(i)t−η∂∂k(i)LKt,W0)⋅x1−(k(i)t−η∂∂k(i)LKt,W0)⋅x2| =|k(i)t⋅x1−k(i)t⋅x2−(η∂∂k(i)LKt,W0x1−η∂∂k(i)LKt,W0x2)| =|k(i)t⋅x1−k(i)t⋅x2|

Next we turn to show that a gradient step improves the separation of vectors coming from different semantic classes.

###### Lemma 4

Fix . Recall that we denote to be the angle between the vectors . Then, with probability on the initialization of we get:

 sign(w(i)0⋅vc1)≠sign(w(i)0⋅vc2)

Proof  Observe the projection of on the plane spanned by . Then, the result is immediate from the symmetry of the initialization of .

###### Lemma 5

Fix . Then, with probability of at least we get for every :

 |k(i)T⋅x1−k(i)T⋅x2|>1dηT∥vc1∥+∥vc2∥2−2σ

Proof  Notice that since , we get that for . Therefore, the probability that deviates by at most -std from the mean is . Thus, we get that:

 P(|w(i)0⋅vcj|≤∥vcj∥)=erf(12√2)

And using the union bound:

 P(|w(i)0⋅vc1|≤∥vc1∥∨|w(i)0⋅vc2|≤∥vc2∥)≤2erf(12√2)<0.77

Thus, using Lemma 4, we get that the following holds with probability of at least :

Assume w.l.o.g that , then using Lemma 3 we get:

 k(i)Tx1 =k(i)0x1−ηT∑t=1∂∂k(i)x1LKt,W0 =k(i)0−ηT∑t=11dw(i)0⋅vc1 =k(i)0−1dηTw(i)0⋅vc1>1dηT∥vc1∥2−σ

In a similar fashion we can get:

 k(i)Tx2<−1dηT∥vc2∥2+σ

And thus the conclusion follows:

 k(i)Tx1−k(i)Tx2>1dηT∥vc1∥+∥vc2∥2−2σ

Finally, we are ready to prove the main theorem.

Proof  of Theorem 1.
We show two things:

1. Fix . By the initialization, we get that for every