# Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets

Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optima---at least those discovered by gradient-based optimization---turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments. We give mathematical explanations for this phenomenon, assuming generic properties (such as dropout stability and noise stability) of well-trained deep nets, which have previously been identified as part of understanding the generalization properties of deep nets. Our explanation holds for realistic multilayer nets, and experiments are presented to verify the theory.

• 4 publications
• 87 publications
• 15 publications
• 195 publications
• 43 publications
• 17 publications
• 49 publications
• 57 publications
02/18/2021

### On Connectivity of Solutions in Deep Learning: The Role of Over-parameterization and Feature Quality

It has been empirically observed that, in deep neural networks, the solu...
12/20/2019

### Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks

The optimization of multilayer neural networks typically leads to a solu...
02/14/2018

### Stronger generalization bounds for deep nets via a compression approach

Deep nets generalize well despite having more parameters than the number...
06/14/2022

### Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

Normalization layers (e.g., Batch Normalization, Layer Normalization) we...
05/06/2021

### Relative stability toward diffeomorphisms in deep nets indicates performance

Understanding why deep nets can classify data in large dimensions remain...
02/25/2021

### Loss Surface Simplexes for Mode Connecting Volumes and Fast Ensembling

With a better understanding of the loss surfaces for multilayer networks...
09/05/2020

### Optimizing Mode Connectivity via Neuron Alignment

The loss landscapes of deep neural networks are not well understood due ...

## 1 Introduction

Efforts to understand how and why deep learning works have led to a focus on the

optimization landscape

of training loss. Since optimization to near-zero training loss occurs for many choices of random initialization, it is clear that the landscape contains many global optima (or near-optima). However, the loss can become quite high when interpolating between found optima, suggesting that these optima occur at the bottom of “valleys” surrounded on all sides by high walls. Therefore the phenomenon of

mode connectivity (Garipov et al., 2018; Draxler et al., 2018) came as a surprise: optima (at least the ones discovered by gradient-based optimization) are connected by simple paths in the parameter space, on which the loss function is almost constant. In other words, the optima are not walled off in separate valleys as hitherto believed. More surprisingly, the paths connecting discovered optima can be piece-wise linear with as few as two segments.

Mode connectivity begs for theoretical explanation. One paper (Freeman and Bruna, 2016) attempted such an explanation for -layer nets, even before the discovery of the phenomenon in multilayer nets. However, they require the width of the net to be exponential in some relevant parameters. Others (Venturi et al., 2018; Liang et al., 2018; Nguyen et al., 2018; Nguyen, 2019)

require special structure in neural network where the number of neurons need to be more than the number of training data points. Thus it remains an open problem to explain even the 2-layer case with realistic parameter settings, let alone standard multilayer architectures.

At first sight, finding a mathematical explanation of mode connectivity phenomenon for multilayer nets—e.g., for a

-layer ResNet on ImageNet—appears very challenging. However, the glimmer of hope is that since the phenomenon exists for a variety of architectures and datasets, it must arise from some generic property of trained nets. The fact that the connecting paths between optima can have as few as two linear segments further bolsters this hope.

Strictly speaking, empirical findings such as in (Garipov et al., 2018; Draxler et al., 2018) do not show connectivity between all optima, but only for typical optima discovered by gradient-based optimization. It seems an open question whether connectivity holds for all optima in overparametrized nets. Section 5 answers this question, via a simple example of an overparametrized two-layer net, not all of whose optima are connected via low-cost paths.

Thus to explain mode connectivity one must seek generic properties that hold for optima obtained via gradient-based optimization on realistic data. A body of work that could be a potential source of such generic properties is the ongoing effort to understand the generalization puzzle of over-parametrized nets—specifically, to understand the “true model capacity”. For example, Morcos et al. (2018) note that networks that generalize are insensitive to linear restrictions in the parameter space. Arora et al. (2018) define a noise stability

property of deep nets, whereby adding Gaussian noise to the output of a layer is found to have minimal effect on the vector computed at subsequent layers. Such properties seem to arise in a variety of architectures purely from gradient-based optimization, without any explicit noise-injection during training—though of course using small-batch gradient estimates is an implicit source of noise-injection. (Sometimes training also explicitly injects noise, e.g. dropout or batch-normalization, but that is not needed for the noise stability to emerge.)

Since resilience to perturbations arises in a variety of architectures, such resilience counts as a “generic” property, and it is natural to prove mode-connectivity as a consequence of this property. We carry this out in the current paper. Note that our goal here is not to explain every known detail of mode connectivity, but rather to give a plausible first-cut explanation.

First, in Section 3 we explain mode connectivity by assuming the network is trained via dropout. In fact, the desired property is weaker: so long as there exists even a single dropout pattern that keeps the training loss close to optimal on the two solutions, our proof constructs a piece-wise linear path between them. The number of linear segments grows linearly with the depth of the net.

Then, in Section 4 we make a stronger assumption of noise stability along the lines of Arora et al. (2018) and show that it implies mode-connectivity using paths with linear segments. While this assumption is strong, it appears to be close to what is satisfied in practice. (Of course, one could explicitly train deep nets to satisfy the needed noise stability assumption, and the theory applies directly to them.)

### 1.1 Related work

The landscape of the loss function for training neural networks has received a lot of attention. Dauphin et al. (2014); Choromanska et al. (2015) conjectured that local minima of multi-layer neural networks have similar loss function values, and proved the result in idealized settings. For linear networks, it is known (Kawaguchi, 2016) that all local minima are also globally optimal.

Several theoretical works explored whether a neural network has spurious valleys (non-global minima that are surrounded by other points with higher loss). Freeman and Bruna (2016) showed that for a two-layer net, if it is sufficiently overparametrized then all the local minimizers are (approximately) connected. However, in order to guarantee a small loss along the path they need the number of neurons to be exponential in the number of input dimensions. Venturi et al. (2018) proved that if the number of neurons is larger than the number of training samples or the intrinsic dimension (infinite for standard architectures), then the neural network cannot have spurious valleys. Liang et al. (2018) proved similar results for the binary classification setting. Nguyen et al. (2018); Nguyen (2019) relaxed the requirement on overparametrization, but still requires the output layer to have more direct connections than the number of training samples.

Some other papers study the existence of spurious local minima. Yun et al. (2018) showed that in most cases neural networks have spurious local minima. Note that a local minimum only need to have loss no larger than the points in its neighborhood, so a local minimum is not necessarily a spurious valley. Safran and Shamir (2018) found spurious local minima for simple two-layer neural networks under a Gaussian input distribution. These spurious local minima are indeed spurious valleys as they have positive definite Hessian.

## 2 Preliminaries

##### Notations

For a vector , we use to denote its norm. For a matrix , we use to denote its operator norm, and to denote its Frobenius norm. We use to denote the set . We use

to denote the identity matrix in

. We use to hide constants and use to hide poly-logarithmic factors.

##### Neural network

In most of the paper, we consider fully connected neural networks with ReLU activations. Note however that our results can also be extended to convolutional neural networks (in particular, see Remark

1 and the experiments in Section 6).

Suppose the network has layers. Let the vector before activation at layer be , where is just the output. For convenience, we also denote the input as . Let be the weight matrix at -th layer, so that we have for and . For any layer let the width of the layer be . We use to denote the -th column of Let the maximum width of the hidden layers be and the minimum width of the hidden layers be .

We use to denote the set of parameters of neural network, and in our specific model, which consists of all the weight matrices ’s.

Throughout the paper, we use , to denote the function that is computed by the neural network. For a data set , the loss is defined as where is a loss function. The loss function is convex in the second parameter. We omit the distribution when it is clear from the context.

##### Mode connectivity and spurious valleys

Fixing a neural network architecture, a data set and a loss function, we say two sets of parameters/solutions and are -connected if there is a path that is continuous with respect to and satisfies: 1. ; 2. and 3. for any , . If , we omit and just say they are connected.

If all the local minimizers are connected, then we say that the loss function has the mode connectivity property. However, as we later show in Section 5, this property is very strong and is not true even for overparametrized two-layer nets. Therefore we restrict our attention to classes of low-cost solutions that can be found by the gradient-based algorithms (in particular in Section 3 we focus on solutions that are dropout stable, and in Section 4 we focus on solutions that are noise stable). We say the loss function has -mode connectivity property with respect to a class of low-cost solutions , if any two minimizers in are -connected.

Mode connectivity is closely related to the notion of spurious valleys and connected sublevel sets (Venturi et al., 2018). If a loss function has all its sublevel sets () connected, then it has the mode connectivity property. When the network only has the mode connectivity property with respect to a class of solutions , as long as the class contains a global minimizer, we know there are no spurious valleys in .

However, we emphasize that neither mode connectivity or lack of spurious valleys implies any local search algorithm can efficiently find the global minimizer. These notions only suggest that it is unlikely for local search algorithms to get completely stuck.

## 3 Connectivity of dropout-stable optima

In this section we show that dropout-stable solutions are connected. More concretely, we define a solution to be -dropout stable if we can remove a subset of half its neurons in each layer such that the loss remains steady.

###### Definition 1.

(Dropout Stability) A solution is -dropout stable if for all such that , there exists a subset of at most hidden units in each of the layers from through such that after rescaling the outputs of these hidden units (or equivalently, the corresponding rows and/or columns of the relevant weight matrices) by some factor 111Note our results will also work if is allowed to vary for each layer . and setting the outputs of the remaining units to zero, we obtain a parameter such that .

Intuitively, if a solution is -dropout stable then it is essentially only using half of the network’s capacity. We show that such solutions are connected:

###### Theorem 1.

Let and be two -dropout stable solutions. Then there exists a path in parameter space between and such that for . In other words, letting be the set of solutions that are -dropout stable, a ReLU network has the -mode connectivity property with respect to .

Our path construction in Theorem 1 consists of two key steps. First we show that we can rescale at least half the hidden units in both and to zero via continuous paths of low loss, thus obtaining two parameters and satisfying the criteria in Definition 1.

###### Lemma 1.

Let be an -dropout stable solution and let be specified as in Definition 1 for . Then there exists a path in parameter space between and such that for .

Though naïvely one might expect to be able to directly connect the weights of and via interpolation, such a path may incur high loss as the loss function is not convex over . In our proof of Lemma 1, we rely on a much more careful construction. The construction uses two types of steps: (a) interpolate between two weights in the top layer (the loss is convex in the top layer weights); (b) if a set of neurons already have 0 output weights, then we can change their input weights arbitrarily. See Figure 1 for a path for a 3-layer network. Here we have separated the weight matrices into blocks: , and . The path consist of 6 steps alternating between type (a) and type (b). Note that for all the type (a) steps, we only update the top layer weights; for all the type (b) steps, we only change rows of a weight matrix (inputs to neurons) if the corresponding columns in the previous matrix (outputs of neurons) are already 0. In Section A we show how such a path can be generalized to any number of layers.

We then show that we can permute the hidden units of such that its non-zero units do not intersect with those of , thus allowing us two interpolate between these two parameters. This is formalized in the following lemma and the proof is deferred to supplementary material.

###### Lemma 2.

Let and and be two solutions such that at least of the units in each hidden layer have been set to zero in both. Then there exists a path in parameter space between and with 8 line segments such that .

Theorem 1 follows immediately from Lemma 1 and Lemma 2, as one can first connect to its dropout version using Lemma 1, then connect to dropout version of using Lemma 2, and finally connect to using Lemma 1 again.

Finally, our results can be generalized to convolutional networks if we do channel-wise dropout (Tompson et al., 2015; Keshari et al., 2018).

###### Remark 1.

For convolutional networks, a channel-wise dropout will randomly set entire channels to 0 and rescale the remaining channels using appropriate factor. Theorem 1 can be extended to work with channel-wise dropout on convolutional networks.

## 4 Connectivity via noise stability

In this section, we relate mode connectivity to another notion of robustness for neural networks—noise stability. It has been observed (Morcos et al., 2018) that neural networks often perform as well even if a small amount of noise is injected into the hidden layers. This was formalized in (Arora et al., 2018), where the authors showed that noise-stable networks tend to generalize well. In this section we use a very similar notion of noise stability, and show that all noise-stable solutions can be connected as long as the network is sufficiently overparametrized.

We begin in Section 4.1 by restating the definitions of noise stability in (Arora et al., 2018) and also highlighting the key differences. In Section 6 we verify these assumptions in practice. In Section 4.2, we first prove that noise stability implies dropout stability (so Theorem 1 applies) and then show that it is in fact possible to connect noise stable neural networks via even simpler paths.

### 4.1 Noise stability

First we will introduce some additional notations and assumptions. In this section, we consider a finite and fixed training set . For a network parameter , the empirical loss function is . Here the loss function is assumed to be -Lipschitz in : for any and any we have The standard cross entropy loss over the softmax function is -Lipschitz.

For any two layers , let be the operator for composition of these layers, such that . Let be the Jacobian of operator at input

. Since the activation functions are ReLU’s, we know

.

Arora et al. (2018) used several quantities to define the noise stability. We list the definitions of these quantities below.

###### Definition 2 (Noise Stability Quantities).

Given sample set , the layer cushion of layer is defined as

For any two layers , the interlayer cushion is defined as

Furthermore, for any layer the minimal interlayer cushion is defined as222Note that and .

The activation contraction is defined as

Intuitively, these quantities measures the stability of the network for both a single layer and multiple layers. Note that this definition of the interlayer cushion is slightly different from the original definition in (Arora et al., 2018). Specifically, in the denominator of the definition, we replace the Frobenius norm of by its spectral norm. In the original definition, the interlayer cushion is at most simply because and With this new definition, the interlayer cushion need not depend on the layer width .

The final quantity of interest is interlayer smoothness, which measures how close the network is to its linear approximation for specific type of noise. In this work we focus on the noise generated by the dropout procedure (Algorithm 1). Let be weights of the original network, and let be the result of applying Algorithm 1 to weight matrices from layer to layer .333Note that is excluded because dropping out columns in already drops out the neurons in layer 1; dropping out columns in would drop out input coordinates, which is not necessary. For any input , let and be the vector before activation at layer using parameters and respectively.

###### Definition 3 (Interlayer Smoothness).

Given the scenario above, define interlayer smoothness

to be the smallest number such that with probability at least

over the randomness in Algorithm 1 for any two layers satisfying for every , , and

 ∥Mi,j(^xii(t))−Ji,jxi(^xii(t))∥≤∥^xii(t)−xi∥∥xj∥ρ∥xi∥, ∥Mi,j(^xii−1(t))−Ji,jxi(^xii−1(t))∥≤∥^xii−1(t)−xi∥∥xj∥ρ∥xi∥.

If the network is smooth (has Lipschitz gradient), then interlayer smoothness holds as long as is small. The assumption here asks the network to behave smoothly in the random directions generated by randomly dropping out columns of the matrices.

Similar to (Arora et al., 2018), we have defined multiple quantities measuring the noise stability of a network, these quantities are indeed small constants as we verify experimentally in Section 6. Finally we combine all these quantities to define a single number that measures the noise stability of a network.

###### Definition 4 (Noise Stability).

For a network with layer cushion , minimal interlayer cushion , activation contraction and interlayer smoothness , if the minimum width layer is at least wide, and for , we say the network is -noise stable for

 ϵ=βcd3/2maxx∈S(∥fθ(x)∥)h1/2minmin2≤i≤d(μiμi→).

The network is more robust if is small. Note that the quantity is small as long as the hidden layer width is large compared to the noise stable parameters. Intuitively, we can think of this value as a single parameter that measures the noise stability of the network.

### 4.2 Noise stability implies dropout stability

We now show that noise stable local minimizers must also be dropout stable, from which it follows that noise stable local minimizers are connected. We first define the dropout procedure we will be using in Algorithm 1.

The main theorem that we prove in this section is:

###### Theorem 2.

Let and be two fully connected networks that are both -noise stable, there exists a path with line segments in parameter space between and such that444Here hides log factors on relevant factors including and for layers . for .

To prove the theorem, we will first show that the networks and are -dropout stable. This is captured in the following main lemma:

###### Lemma 3.

Let be an -noise stable network, and let be the network with weight matrices from layer to layer dropped out by Algorithm 1 with dropout probability . For any , assume for For any , define the network on the segment from to as . Then, with probability at least over the weights generated by Algorithm 1, , for any .

The main difference between Lemma 3 and Lemma 1 is that we can now directly interpolate between the original network and its dropout version, which reduces the number of segments required. This is mainly because in the noise stable setting, we can prove that after dropping out the neurons, not only does the output remains stable, every intermediate layer also remains stable.

From Lemma 3, the proof of Theorem 2 is very similar to the proof of Theorem 1. The detailed proof is given in Section B.

The additional power of Lemma 3 also allows us to consider a smaller dropout probability. The theorem below allows us to trade dropout fraction with the energy barrier that we can prove—if the network is highly overparametrized, one can choose a small dropout probability which allow the energy barrier to be smaller.

###### Theorem 3.

Suppose there exists a network with layer width for each layer that achieves loss , and minimum hidden layer width . Let and be two -noise stable networks. For any dropout probability , if for any , then there exists a path with line segments in parameter space between and such that for .

Intuitively, we prove this theorem by connecting and via the neural network with narrow hidden layers. The detailed proof is given in Section B.

## 5 Disconnected modes in two-layer nets

The mode connectivity property is not true for every neural network. Freeman and Bruna (2016) gave a counter-example showing that if the network is not overparametrized, then there can be different global minima of the neural network that are not connected. Venturi et al. (2018) showed that spurious valleys can exist for 2-layer ReLU nets with an arbitrary number of hidden units, but again they do not extend their result to the overparametrized setting. In this section, we show that even if a neural network is overparametrized—in the sense that there exists a network of smaller width that can achieve optimal loss—there can still be two global minimizers that are not connected.

In particular, suppose we are training a two-layer ReLU student network with hidden units to fit a dataset generated by a ground truth two-layer ReLU teacher network with hidden units such that the samples in the dataset are drawn from some input distribution and the labels computed via forward passes through the teacher network. The following theorem demonstrates that regardless of the degree to which the student network is overparametrized, we can always construct such a dataset for which global minima are not connected.

###### Theorem 4.

For any width and and convex loss function such that is minimized when , there exists a dataset generated by ground-truth teacher network with two hidden units (i.e. ) and one output unit such that global minimizers are not connected for a student network with hidden units.

Our proof is based on an explicit construction. The detailed construction is given in Section C.

## 6 Experiments

We now demonstrate that our assumptions and theoretical findings accurately characterize mode connectivity in practical settings. In particular, we empirically validate our claims using standard convolutional architectures—for which we treat individual filters as the hidden units and apply channel-wise dropout (see Remark 1)—trained on datasets such as CIFAR-10 and MNIST.

Training with dropout is not necessary for a network to be either dropout-stable or noise-stable. Recall that our definition of dropout-stability merely requires the existence of a particular sub-network with half the width of the original that achieves low loss. Moreover, as Theorem 3 suggests, if there exists a narrow network that achieves low loss, then we need only be able to drop out a number of filters equal to the width of the narrow network to connect local minima.

First, we demonstrate in the left plot in Figure 2, on MNIST that 3-layer convolutional nets (not counting the output layer) with 32 filters in each layer tend to be fairly dropout stable—both in the original sense of Definition 1 and especially if we relax the definition to allow for wider subnetworks—despite the fact that no dropout was applied in training. For each trial, we randomly sampled dropout networks with exactly non-zero filters in each layer and report the performance of the best one. In the center plot, we verify for we can construct a linear path from our convolutional net to a dropout version of itself. Similar results were observed when varying . Finally, in the right plot we demonstrate the existence of 3-layer convolutional nets just a few filters wide that are able to achieve low loss on MNIST. Taken together, these results indicate that our path construction in Theorem 3 performs well in practical settings.

We also demonstrate that the VGG-11 (Simonyan and Zisserman, 2014) architecture trained with channel-wise dropout (Tompson et al., 2015; Keshari et al., 2018) with at the first three layers555we find the first three layers are less resistant to channel-wise dropout. and at the others on CIFAR-10 converges to a noise stable minima—as measured by layer cushion, interlayer cushion, activation contraction and interlayer smoothness. The networks under investigation achieves 95% training and 91% test accuracy with channel-wise dropout activated, in comparison to 99% training and 92% test accuracy with dropout turned off. Figure 3 plots the distribution of the noise stability parameters over different data points in the training set. These plots suggest that they behave nicely. Interestingly, we also discovered that networks trained without channel-wise dropout exhibit similarly nice behavior on all but the first few layers. Finally, we demonstrate that the training loss and accuracy along the path described by Theorem 3 between two noise stable VGG-11 networks and remain fairly low and high respectively.

Further details on all experiments are provided in Section D.1.

## References

• S. Arora, R. Ge, B. Neyshabur, and Y. Zhang (2018) Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296. Cited by: Appendix B, §1, §1, §4.1, §4.1, §4.1, §4, §4.
• A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun (2015) The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pp. 192–204. Cited by: §1.1.
• Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pp. 2933–2941. Cited by: §1.1.
• F. Draxler, K. Veschgini, M. Salmhofer, and F. A. Hamprecht (2018) Essentially no barriers in neural network energy landscape. arXiv preprint arXiv:1803.00885. Cited by: Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets, §1, §1.
• C. D. Freeman and J. Bruna (2016) Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540. Cited by: §1.1, §1, §5.
• T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson (2018) Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems, pp. 8789–8798. Cited by: Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets, §1, §1.
• K. Kawaguchi (2016) Deep learning without poor local minima. In Advances in neural information processing systems, pp. 586–594. Cited by: §1.1.
• R. Keshari, R. Singh, and M. Vatsa (2018) Guided dropout. arXiv preprint arXiv:1812.03965. Cited by: §3, §6.
• S. Liang, R. Sun, Y. Li, and R. Srikant (2018) Understanding the loss surface of neural networks for binary classification. In

International Conference on Machine Learning

,
pp. 2840–2849. Cited by: §1.1, §1.
• A. S. Morcos, D. G. Barrett, N. C. Rabinowitz, and M. Botvinick (2018) On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959. Cited by: §1, §4.
• Q. Nguyen, M. C. Mukkamala, and M. Hein (2018) On the loss landscape of a class of deep neural networks with no bad local valleys. arXiv preprint arXiv:1809.10749. Cited by: §1.1, §1.
• Q. Nguyen (2019) On connected sublevel sets in deep learning. arXiv preprint arXiv:1901.07417. Cited by: §1.1, §1.
• I. Safran and O. Shamir (2018) Spurious local minima are common in two-layer relu neural networks. In International Conference on Machine Learning, pp. 4430–4438. Cited by: §1.1.
• K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §6.
• J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler (2015) Efficient object localization using convolutional networks. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

,
pp. 648–656. Cited by: §3, §6.
• J. A. Tropp (2012) User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 12 (4), pp. 389–434. Cited by: Lemma 7.
• L. Venturi, A. S. Bandeira, and J. Bruna (2018) Spurious valleys in two-layer neural network optimization landscapes. arXiv preprint arXiv:1802.06384. Cited by: §1.1, §1, §2, §5.
• C. Yun, S. Sra, and A. Jadbabaie (2018) A critical view of global optimality in deep learning. arXiv preprint arXiv:1802.03487. Cited by: §1.1.

## Appendix A Proofs for connectivity of dropout-stable optima

Proof of Lemma 1. Without loss of generality, suppose for each that the subset of non-zero hidden units in each layer are all indexed between and . For , we can partition into quadrants such that . (Here, . If

is odd, when we write

in the other quadrants we implicitly pad it with zeros in a consistent manner.) Similarly, we can partition

such that and such that . We will sometimes use the notation to refer to the value of at a given point on our path, while will always refer to the value of at . We now proceed to prove via induction the existence of a path from to for all whose loss is bounded by , from which the main result immediately follows.

Base case: from to As a base case of the induction, we need to construct a path from to , such that the loss is bounded by . First, note that setting a particular subset of columns (e.g. the right half of columns) in to zero is equivalent to setting the corresponding rows (e.g. the bottom half of rows) of to zero. So from the fact that it follows that we can equivalently replace with without increasing our loss by more than .

In fact, because our loss function is convex over we can actually interpolate between and while keeping our loss below at every point along this subpath.

Then, because we can modify both and any way we’d like without affecting the output of our network. In particular, we can interpolate between and while keeping our loss constant long this subpath, thus arriving at .

From to

Suppose we have found a path from to such that (1) , (2) for , (3) , and (4) for , such that the loss along the path is at most . Note that satisfies all these assumptions, including in particular (2) as there are of course no between and . Now let us extend this path to .

First, because the rightmost columns of are zero for , we can modify the bottom rows of for without affecting the output of our network. In particular, we can set to , as well as to for . From the fact that the loss is convex over and that , it then follows that we can set to via interpolation while keeping our loss below . In particular, note that because the off-diagonal blocks of are zero for , interpolating between the leftmost columns of being non-zero and the rightmost columns of being non-zero simply amounts to interpolating between the outputs of the two subnetworks comprised respectively of the first and last rows of for .

Once we have the leftmost columns of set to zero and in block-diagonal form for , we can proceed to modify the top rows of however we’d like without affecting the output of our network. Specifically, let us set to . We can then reset to via interpolation—this time without affecting our loss since the weights of our two subnetworks are equivalent—and afterwards set to zero and to zero for —again without affecting our loss since the rightmost columns of are now zero, meaning that the bottom rows of have no affect on our network’s output.

Following these steps, we will have for and . And so we are now free to set the bottom rows of to zero without affecting our loss, thus arriving at .

###### Lemma 4.

Let be a parameter such that at least of the units in each hidden layer have been set to zero. Then we can achieve an arbitrary permutation of the non-zero hidden units of via a path consisting of just 5 line segments such that our loss is constant along this path.

###### Proof.

Let be some permutation over the units in layer . Without loss of generality, suppose all non-zero units in layer are indexed between and , and define as any one-to-one mapping such that if . Note that when we refer to a unit as “set to zero”, we mean that both row of and column of have been set to zero.

To permute the units of layer , we can first simultaneously copy the non-zero rows of into a subset of the rows that have been set to zero. Specifically, for we can copy row of into row via interpolation and without affecting our loss, due to the fact that column in is set to zero. We can then set column of to zero while copying its value to column , again via interpolation and without affecting our loss since rows and of are now equivalent.

Following these first two steps, the first columns of will have been set to zero. Thus, for all such that we can copy row of into row without affecting our loss. We can then set column of to zero while copying its value into column via interpolation and without affecting our loss since rows and of are now equivalent. Setting row to zero—again for all such that —completes the permutation for layer .

Note that because we leave the output of layer unchanged throughout the course of permuting the units of layer , it follows that we can perform all swaps across all layers simultaneously. And so from the fact that permuting each layer can be done in 5 steps—each of which consists of a single line segment in parameter space—the main result immediately follows. ∎

Proof of Lemma 2. Without loss of generality, suppose for that the subset of non-zero hidden units in each layer are all indexed between and . Note that when we refer to a unit as “set to zero", we mean that both the corresponding row of and column of have been set to zero. Adopting our notation in Lemma 1, we can construct a path from to as follows.

First, from the fact that the second half of units in each hidden layer have been set to zero in we have that , for , and . Similarly, half the rows of are zero, half the rows and columns of are zero for , and half the columns of are zero. Note that the indices of the non-zero units in may intersect with those of the non-zero units in . For , let denote the submatrix of corresponding to the non-zero rows and columns of .

Because are block-diagonal for and the rightmost columns of are zero, starting from we can modify the bottom rows of for any way we’d like without affecting our loss—as done in our path construction for Lemma 1. In particular, let us set to for and to . Then, from the fact that our loss function is convex over it follows that we can set to via interpolation while keeping our loss below . Finally, from the fact that the leftmost columns of are now zero and are still block-diagonal for , it follows that we can set to zero for without affecting our loss—thus making equal to for and equal to .

To complete our path from to we now simply need to permute the units of each hidden layer so as to return the elements of to their original positions in for each . From Lemma 4 it follows that we can accomplish this permutation via 5 line segments in parameter space without affecting our loss. Combined with the previous steps above, we have constructed path from to consisting of a total of 8 line segments whose loss is bounded by .

Proof of Theorem 1. First, from Lemma 1 we know we can construct paths from both to and to while keeping our loss below and respectively. From Lemma 2 we know that we can construct a path from to such that the loss along the path is bounded by . The main result then follows from the fact that and due to and both being -dropout stable.

## Appendix B Proofs for connectivity via noise stability

In this section, we give detailed proofs showing that noise stability implies connectivity. In the following lemma, we first show that the network output is stable if we randomly dropout columns in a single layer using Algorithm 1.

###### Lemma 5.

For any layer , let be a set of matrix/vector pairs of size where and satisfying . Given , let be the output of Algorithm 1 with dropout probability . Assume for Given any , let , with probability at least , we have for any that . Further assuming , we know with probability at least no less than fraction of columns in are zero vectors.

Intuitively, this lemma upper-bounds the change in the network output after dropping out a single layer. In the lemma, we should think of as the input to the current layer, as the layer matrix and as the Jacobian of the network output with respect to the layer output. If the activation pattern does not change after the dropping out, is exactly the output of the dropped out network and is the change in the network output.

Proof of Lemma 5. Fixing and one pair , we show with probability at least , . Let be the -th column of . Then by definition of in the algorithm, we know

 U(^Ai−Ai)x =∑k,jUk[Ai]kjxj(δj−1) =∑j(∑kUk[Ai]kj)xj(δj−1),

where is an i.i.d. Bernoulli random variable which takes the value with probability and takes the value with probability .

Let be the -th column of . Because , (any bounded away from 1 will work). Hence the norm for each individual term can be bounded as follows.

 ∥∥ ∥∥(∑kUk[Ai]kj)xj(δj−1)∥∥ ∥∥ (∗)≤O(∥x∥√hi−1)∥U[Ai]j∥ ≤O(∥x∥√hmin)∥U∥∥[Ai]j∥ (†)≤O(√p∥U∥∥Ai∥F∥x∥√hmin),

where (*) uses the assumption that and holds because we assume for

For the total variance, we have

 σ2: =∑jE⎡⎣∥∥ ∥∥(∑kUk[Ai]kj)xj(δj−1)∥