Neural Teleportation

12/02/2020 ∙ by Marco Armenta, et al. ∙ Université de Sherbrooke Université de Bourgogne 10

In this paper, we explore a process called neural teleportation, a mathematical consequence of applying quiver representation theory to neural networks. Neural teleportation "teleports" a network to a new position in the weight space, while leaving its function unchanged. This concept generalizes the notion of positive scale invariance of ReLU networks to any network with any activation functions and any architecture. In this paper, we shed light on surprising and counter-intuitive consequences neural teleportation has on the loss landscape. In particular, we show that teleportation can be used to explore loss level curves, that it changes the loss landscape, sharpens global minima and boosts back-propagated gradients. From these observations, we demonstrate that teleportation accelerates training when used during initialization regardless of the model, its activation function, the loss function, and the training data. Our results can be reproduced with the code available here: https://github.com/vitalab/neuralteleportation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 6

page 9

page 12

page 13

page 15

Code Repositories

neuralteleportation

Isomorphisms of quiver representations applied to neural networks.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite years of research, our theoretical understanding of neural networks, their loss landscape and their behavior during training and testing is still limited. A recent novel theoretical analysis of neural networks using quiver representation theory [1] has been introduced, where the algebraic and combinatorial nature of neural networks is exposed. Among other things, the authors present a by-product of representing neural networks through the lens of quiver representation theory, i.e., the notion of neural teleportation.

As will be explained later, neural teleportation is the mathematical consequence of applying isomorphisms of quiver representations to neural networks. This process has the unique property of changing the weights and the activation functions of a network while, at the same time, preserving its function, i.e., a teleported network makes the same predictions for the same input values as the original, non-teleported network.

Isomorphisms of quiver representations have already been used on neural networks, often unbeknownst to the authors, through the concept of positive scale invariance (also called positive homogeneity) of ReLU networks [5, 7, 13, 14]. This concept derives from the fact that one can choose a positive number

on each hidden neuron of a ReLU network, multiply every incoming weight to that neuron by

and divide every outgoing weight by and still have the same network function. Positive homogeneity in previous works is restricted to positive scale invariant activation functions and to positive scaling factors. Models like maxout networks[8], leaky rectifiers[9] and deep linear networks[15] are also positive scale invariant.

Figure 1: (a) Network quiver as introduced in [1]. (b) Neural network based on with weights and activation functions . Input, hidden, bias, and output neurons are in yellow, gray, orange and red, respectively. (c) Same neural network but with a change of basis (CoB) at each neuron . (d) Neural teleportation of the weights attached to neuron . The resulting activation function is .

Positive scale invariance is often seen as some sort of trick to change the weights of a ReLU network without affecting the network function. In this paper, we show that neural teleportation is more than a trick, as it has concrete consequences on the loss landscape and the network optimization. We account for various theoretical and practical consequences that teleportation has on neural networks.

Our findings can be summarized as follows:

  1. Neural teleportation can be used to explore loss level curves;

  2. Micro-teleportation vectors

    have the sole property of being rigorously perpendicular to back-propagated gradients computed with any kind and amount of labeled data, even random data with random labels;

  3. Neural teleportation changes the flatness of the loss landscape;

  4. The back-propagated gradients of a teleported network scale with the magnitude of the teleportation;

  5. Randomly teleporting a network before training speeds up convergence. This is valid for any dataset, loss function, activation function, and architecture.

2 Neural teleportation

In this section, we explain what neural teleportation is and the implications it has on the loss landscape and the back-propagated gradients. For more details on the theoretical interpretation of neural networks according to quiver representation theory, please refer to [1].

2.1 Isomorphisms and Change of Basis (CoB)

Neural networks are often pictured as oriented graphs. The authors in [1] show that neural network graphs are a specific kind of quiver with a loop at every node. They call these graphs network quivers (c.f. figure 1(a)). They also mention that neural networks, as they are generally defined, are network quivers with a weight assigned to every edge and an activation function at every loop (c.f. figure 1(b)).

According to representation theory, two quiver representations are equivalent if an isomorphism exists between the two. Since neural networks are a type of quiver representation, isomorphisms of quiver representations also applies to them.

Isomorphisms are given by sets of non-zero real numbers subject to some conditions. One such set of non-zero numbers is called a change of basis (CoB) [3], where each node of the quiver is assigned a non-zero number. In order to apply an isomorphism to a neural network, each neuron must be assigned a CoB represented by in figure 1(c).

The process of applying an isomorphism to a neural net is called neural teleportation. For a teleportation to be valid, the CoB must comply to the following four conditions:

  1. The CoB of every input, output and bias neuron must be equal to (i.e. in figure 1(c)).

  2. Neurons

    connected by a residual connection should have the same CoB :

  3. For convolutional layers, the neurons of a given feature map should share the same CoB.

  4. For batch norm layers, a CoB must be assigned to parameters and

    , but not to the mean and variance.

  5. The CoBs of a dense connection are obtained by concatenating those in its input layers.

Condition 1 is the isomorphism condition. Any two networks related by a CoB satisfying this condition are said to be isomorphic. This leads to the following theorem.

Theorem 2.1

Isomorphic neural networks have the same network function.

Said otherwise, despite having different weights and different activation functions, isomorphic neural networks return rigorously the same predictions given the same inputs. It also means that they have exactly the same loss value (c.f. [1] for the proof).

Conditions 2 to 5 have to do with the architecture of the network. They ensure that the produced isomorphic networks share the same architecture (again c.f. [1] for more details). These conditions ensure that the teleportation of a residual connection remains a residual connection (condition 2), the teleportation of a conv layer remains a conv layer (condition 3), the teleportation of a batch norm remains a batch norm (condition 4) and the teleportation of a dense layer remains a dense layer (condition 5). Please note that Meng et al. [13] introduced a concept similar to condition 3.

2.2 Teleporting a neural network

Neural teleportation is a process by which the weights and the activation functions of a neural network are converted to a new set of weights , and a new set of activation functions . From a practical standpoint, this process is simple and illustrated in figure 1(d).

Considering the weight of the connection from neuron to neuron , and (resp. ) the CoB of neuron (resp. ), the teleportation of that weight is simply:

(1)

To teleport an entire network, this operation is carried out for every weight of the network. Note that positive homogeneity [5, 7, 13, 14] implies a similar operation but with the restriction of positive scaling factors. In the case of batch norm layers, the parameter is treated like a weight between two hidden neurons and as a weight starting from a bias neuron. As such, and are teleported like any other weight using Eq.(1).

Neural teleportation also applies to activation functions. If is the activation function of neuron in figure 1(d), then the teleported activation is

(2)

This is a critical operation to make sure the pre- and post-teleported networks have the same function. We can see that if and is positive scale invariant (like ReLU) then . Neural teleportation is thus a generalization of the concept of positive scale invariance [5, 7, 13, 14]. C.f. [1] for the proof that neural teleportation works for any activation function.

2.3 Feature maps

In order to put forward more concretely the effect that neural teleportation has on a neural network, we trained (on MNIST) and teleported two ResNet18 models: one with ReLU activations and one with tanh activations. Feature maps of the original and teleported trained networks are shown in figure 

2. As one can see, the feature maps before and after teleportation are, in both cases, very different. This underlines the fact that while teleportation preserves the network function, it changes the features that it has learned.

Figure 2: (a) MNIST images. (b) Feature maps of a trained ReLU ResNet18. (c) Same feature maps after teleportation. (d) Feature maps of a trained tanh ResNet18. (e) Same feature maps after teleportation.

3 Previous work

Concepts similar to that of neural teleportation have been proposed in the literature, the notion of positive scale invariance being the closest. It was shown that positive scale invariance affects training by inducing symmetries on the weight space. As such, many methods have tried to take advantage of it [4, 5, 10, 14]. Our notion of teleportation gives a different perspective as i) it allows any non-zero real-value CoB to be used as scaling parameters, ii) it acts on any kind of activation functions, and iii) our approach do not impose any constraints on the structure of the network nor the data it processes. Also, neural teleportation does not require new update rules as it naturally adapts to existing gradient descent algorithms.

In [13, 14], the authors account for the fact that ReLU networks can be represented by the values of "basis paths" connecting input neurons to output neurons. Meng et al. [13] made clear that these paths are interdependent and proposed an optimization based on it. They designed a space that manages these dependencies, proposing an update rule for the values of the paths that are later converted back into network weights. Unfortunately, their method only works for neural nets having the same number of neurons in each hidden layer. Furthermore, they guarantee no invariance across residual connections. This is unlike neural teleportation, which works for any network architecture, including residual networks.

Scale-invariance in networks with batch normalization has been observed

[2, 6], but not in the sense of quiver representations.

Positive scale invariance of ReLU networks has also been used to prove that common flatness measures can be manipulated by the re-scaling factors [7]. Here, we experimentally show that the loss landscape changes when teleporting a network with positive or negative CoB, regardless of its architecture and activation functions.

4 Neural teleportation and the loss landscape

Despite its apparent simplicity, neural teleportation has surprising consequences on the loss landscape. In this section, we underline such consequences and lay empirical evidences for it.

Figure 3: (a) 2D slice of a loss landscape with a W dot as the location of a given network. The V dots are two teleported versions of W inside the same landscape while the MV dot stands for a micro-teleportation of W. (b) The dots are teleported versions of W in a different loss landscape. Since neural teleportation preserves the network function, the six networks have rigorously the same loss value.

4.1 Inter- and intra-loss landscape teleportation

As mentioned before, teleporting a positive scale invariant activation function with a positive results in . This means that the teleported network ends up inside the same loss landscape but with weights at a different location than (c.f., figure 3(a)). In other cases (for negative or non-positive scale invariant activation functions ) neural teleportation changes the activation function and thus the overall loss landscape. For example, with and a ReLU function, the teleportation of becomes: .

Thus, the way CoB values are chosen has a concrete effect on how a network is teleported. A trivial case is when for every hidden neuron , which leads to no transformation at all: and

. For our method, we choose the CoB values by randomly sampling either of two distributions. The first one is a uniform distribution centered on

: with . We call the CoB-range; the larger is, the more different the teleported network weights will be from . Also, when this sampling operation is combined with positive scale invariant activation function (like ReLU), the teleported activation functions stays unchanged () and thus the new weights are guaranteed to stay within the same landscape as the original set of weights (as in figure 3 (a)). We thus call this operation an intra-landscape neural teleportation.

The other distribution is a mixture of two equal-sized uniform distributions: one centered at and the other at :

With high probability, a network teleported with this sampling will end up in a new loss landscape as illustrated in figure 

3 (b). We thus call this operation an inter-landscape neural teleportation.

4.2 Loss level curves

Since can be assigned any non-zero real values, a network can be teleported an infinite amount of times to an infinite amount of locations within the same landscape or across different landscapes. This is illustrated in figure 3 where is teleported to 5 different locations in two different loss landscapes. Because of the very nature of neural teleportation, which preserves the network function, these 6 neural networks have the same loss value. Thus, networks teleported in the same landscape sit on the same level curve.

We validated this assertion by teleporting an MLP 100 times with an inter-landscape sampling and a CoB-range of . While the mean average difference between the original weights and the teleported weights is of (a large value considering that the magnitude of is of ), the average loss difference was of , i.e., no more than a floating point error.

4.3 Micro teleportation

One can easily prove that the gradient of a function is always perpendicular to its local level curve (c.f. [16] chap. 2). However, back-propagation computes noisy gradients that depend on the amount of data used. Thus, the noisy back-propagated gradients do not a priori have to be strictly perpendicular to the local loss level curves. This may give rise to the legitimate concern of whether teleported networks have more erratic, and thus error prone, gradients.

This concern can be (partly) answered via the notion of micro-teleportation, which derives from the previous subsection. Let’s consider the intra-landscape teleportation of a network with positive scale invariant activation functions (like ReLU) with a CoB-range close to zero. In that case, for every neuron and the teleported weights (computed following Eq.(1)) end up being very close to . We call this a micro-teleportation and illustrate it in figure 1 (a) (the MV dot illustrates the micro teleportation of W).

Because and are isomorphic, they both lie on the same loss level. Thus, if is small enough, the vector between and is locally co-linear to the local loss level curve. We call a micro-teleportation vector.

Figure 4: [Top] Angle histograms between micro-teleportation vectors and back-prop. gradients for VGGnet on CIFAR10 data and MLP on random data. The other rows are angle histograms between micro-teleportation vectors and random vectors, between gradient and random vectors and between random vectors.

A rather counter-intuitive empirical property of micro-teleportations is that is perpendicular to any back-propagated gradient. This surprising observation leads to the following conjecture.

Figure 5:

(a) Loss/accuracy profiles obtained by linearly interpolating between two optimized MLP

and . Network is for and for (green vertical lines). Dotted lines are for training and solid lines for validation. Remaining plots are similar interpolations but between teleported versions of and with CoB range of (b) , (c) , and (d) .
Conjecture 4.1

For any neural network, any dataset and any loss function, there exists a sufficiently small CoB-range so that every micro-teleportation produced with it is perpendicular to the back-propagated gradient.

We empirically assessed this conjecture by computing the angle between micro-teleportation vectors and back-propagated gradients of four models (MLP, VGG, ResNet and DenseNet) on three datasets (CIFAR-10, CIFAR-100 and random data)

different batch sizes (8 and 64) with a CoB-range . A cross-entropy loss was used for all models. Figure 4 shows angular histograms for VGG on CIFAR-10 and an MLP (with one hidden layer of 128 neurons) on random data (results for the other configurations are in the appendix). The first row shows the angular histogram between micro-teleportation vectors and gradients. We used batch sizes of 8 for MLP and 64 for VGG. As can be seen, both histograms exhibit a clear Dirac delta on the angle. As a mean of comparison, we report angular histograms between micro-teleportation vectors and random vectors, between back-propagated gradients and random vectors, and between random vectors. While random vectors in a high-dimensional space are known to be quasi-orthogonal [11], by no means are they exactly orthogonal, as shown by the green histograms. The Dirac delta of the first row versus the wider green distributions is a clear illustration of our conjecture.

These empirical findings suggest that although back-propagation computes noisy gradients, their orientation in the weight space does not entirely depend on the data nor the loss function, i.e., perpendicular to micro-teleportation vectors.

4.4 Teleportation and landscape flatness

It has been shown in [7] for positive scale invariant activation functions, that one can find a CoB (called reparametrization there) so that the most commonly used measures for flatness can be manipulated to obtain a sharper minimum with the exact same network function. We empirically show that neural teleportation systematically sharpens the found minima, independently of the architecture and the activation functions.

A commonly used method to compare the flatness of two trained networks is to plot the 1D loss/accuracy curves on the interpolation between the two sets of weights [7]. It is also well known that small batch sizes produce flatter minima than bigger batch sizes. We trained on CIFAR10 a 5 hidden-layer MLP two times, first with a batch size of (network ) than with a batch size of (network ). Then, as in [12], we plotted the 1D loss/accuracy curves on the interpolation between the two weight vectors of the networks (c.f.figure 5(a)). We then performed the same interpolation but between the teleportation of A and B with CoB-ranges of , , and . As can be seen from figure 5, the landscape becomes sharper as the CoB-range increases. Said otherwise, a larger teleportation leads to a locally-sharper landscape. More experiments with other models can be found in the appendix.

5 Optimization

In the previous section, we showed that teleportation moves a network along a loss level curve and sharpens the local loss landscape. In this section, we show that any teleportation increases the magnitude of the local gradient and accelerates the convergence rate when used to initialize the weights of the network.

Figure 6: Validation accuracies for MLP, VGG, ResNet and DenseNet on CIFAR-10 (top row) and CIFAR-100 (bottom row). The boxes are produced with runs for every learning rate (, and ). Note that the MLP on CIFAR-100 does not seem to move towards convergence without teleportation.
Figure 7: Mean absolute difference ( std-dev) between the back-propagated gradients’ magnitudes of teleported networks and their original, not teleported, network. Larger CoB generate larger gradients. Results are for CIFAR-10. Plots for these same models on CIFAR-100 can be found in the appendix.

5.1 Back-propagated gradients of a teleportation

It has already been noticed [7, 14] that under positive scaling of ReLU networks, the gradient scales inversely than the weights with respect to the CoB. Here, we show that the back-propagated gradient of a teleported network has the same property, regardless of the architecture of the network, the data, the loss function and the activation functions.

Let be a neural network with a set of weights and activation functions . We denote

the weight tensor of the

-th layer of the network, and analogously for the teleportation . The CoB at layer is denoted , which is a column vector of non-zero real numbers. Let’s also consider a data sample with the input data and the target value and the gradient of the networks and with respect to . Following Eq. (1), we have that

(3)

where the operation multiplies the columns of the matrix by the coordinate values of vector , while the operation multiplies the rows of matrix by the coordinate values of the vector .

Figure 8: Validation plots produced over runs over three learning rates (, and ) of MLPs with activations LeakyReLU (first column), Tanh (second column) and ELU (third column) resp., on MNIST (top row) and CIFAR-10 (bottom row) datasets.
Theorem 5.1

Let be a teleportation of the neural network with respect to the CoB . Then

for every layer of the network (proof in the appendix).

If we look at the magnitude of the teleported gradient, we have that

We can see that the ratio appears multiplying the squared non-teleported gradient . For an intra-landscape teleportation, is randomly sampled from a uniform distribution for . Since

are independent random variables, the mathematical expectation of this squared ratio is

Thus, when (i.e., no teleportation as described in section 4.1), then which means that on average the gradients are multiplied by and thus remain unchanged. But when , then which means that the gradients magnitude gets increasingly large factor.

We empirically validate this proposition with four different networks in figure 7. There we put the CoB-range on the x-axis versus 20 different teleportations for which we computed the absolute difference of normalized gradient magnitude: . We can see that a larger CoB-range leads to a larger gradient magnitude.

This shows that randomly teleporting a neural network increases the sharpness of the surrounding landscape and thus the magnitude of the local gradient. This is true for any network architecture and any dataset. This analysis also holds true for inter-landscape CoB sampling.

ex

Figure 9: Validation plots produced over runs over three learning rates (, and ) for MLP, VGG, ResNet and DenseNet with normal (top row) and Xavier (bottom row) initializations on CIFAR-10.
Figure 10: Validation plots produced over runs over three learning rates (, and ) for MLP, VGG, ResNet and DenseNet on CIFAR-10 comparing a usual training and a training with pseudo-teleportation. It can be seen that pseudo-teleportation helps the training of the MLP, but it worsens the training for the other models.

5.2 Convergence speed up

An immediate consequence of the previous proposition is that teleporting a network just after its initialization sets it in an area of the landscape where the slope is steeper and thus that convergence should be quicker.

In order to assess this statement, we trained four models: an MLP with 5 hidden layers with 500 neurons each, a VGGnet, a ResNet18, and a DenseNet, all with ReLU activation. Training was done on two datasets (CIFAR-10 and CIFAR-100) with two optimizers (vanilla SGD and SGD with momentum), and three different learning rates (0.01, 0.001 and 0.0001) for a total of 72 configurations. For each configuration, we trained the network with and without neural teleportation after initialization. Training was done five times for 100 epochs. The chosen CoB range

is with an inter-landscape sampling for all these experiments. The teleported and non-teleported networks were initialized with the same randomized operator following the "Kaiming" method.

Figure 11: Weight histogram of an MLP before teleportation [Left] and after teleportation [Right] with CoB-range of

. Initialization is the standard on PyTorch which is uniform and contains values between

and while the teleported weights are more broadly distributed as shown in the zooms.

This resulted into a total of 720 training curves that we averaged across the learning rates ( std-dev boxes) and put in figure 6. The light green and red boxes are results from normal training while the darker green and red boxes are results from the same training but with neural teleportation after initialization. As can be seen, neural teleportation accelerates the convergence of every model, every dataset and every optimizer.

To make sure that these results are not unique to ReLU networks, we trained the MLP on MNIST and CIFAR-10 with three different activation functions : LeakyReLU, Tanh and ELU, again with and without neural teleportation after the uniform initialization, which is the default PyTorch init mode for fully connected layers. As can be seen from figure 8, here again teleportation accelerates training across datasets and models.

In order to measure the impact of the initialization procedure, we ran a similar experiment with a basic Gaussian initialization as well as an Xavier initialization. We obtained the same pattern for the Gaussian case and for the Xavier initialization; teleportation helps the training of the VGG net, while for the others the difference is less prominent (see figure 9).

Figure 11(left) shows the weight histograms of an MLP network initialized with a uniform distribution (PyTorch’s default) before and after teleportation. As can be seen, the weight distributions are significantly different. In order to show that convergence speed up does not depend on the weight distribution but on the teleportation process itself, we randomly generated weights so their distribution follows that of Figure 11 [right]. We called this a pseudo-teleportation. We then trained the four models with the same regime, comparing pseudo-teleportation to no teleportation. Results in figure 10 reveal that pseudo-teleportation does not improve training.

6 Conclusion

In this paper, we provided empirical evidence that neural teleportation (a simple concept that generalizes the notion of positive scale invariance of ReLU networks) can project the weights (and the activation functions) of a network to an infinite amount of places in the weight space while always preserving the network function. This operation has counter-intuitive consequences on the loss landscape. We show that : (1) network teleportation changes the feature maps learned by a network (Fig. 2); (2) It can be used to explore loss level curves; (3) Micro-teleportation vectors are always perpendicular to back-propagated gradients (Fig. 4); (4) teleportation reduces the landscape flatness (Fig. 3) and increase the local gradient (Fig. 7) proportionally to the CoB range used. All these observations are true regardless of the network architecture, the activation functions and the dataset. As a consequence, we have shown that teleporting a network right before its training accelerates gradient descent.

References

Appendix

Micro-teleportations

Following section 4.3 on micro-teleportations, we provide more angular histograms between back-propagated gradients and random micro-teleportations of the network. We present here some more histograms for models MLP, VGG, ResNet and DenseNet on datasets CIFAR-10, CIFAR-100 and random data. Each histogram in figures 12, 13 and 14 was computed with 100 random micro-teleportations with a CoB-range .

Teleportation and landscape flatness

Following section 4.4, we trained a VGGnet with two different batch sizes and then produced a 1D loss plot by interpolating the two models. We then re-produced 1D loss plots by teleporting the models. As shown in figure 15, neural teleportation has the effect of sharpening the loss landscape.

Figure 12: Micro teleportations
Figure 13: Micro teleportations
Figure 14: Micro teleportations

Back-propagated gradients of a teleportation

Here we give the proof of Theorem 5.1 and reproduce the experiment shown in figure 7 (of the main paper) for the CIFAR-100 dataset. Results are shown in figure 16.

Figure 15: (a) Loss/accuracy profiles obtained by linearly interpolating between two optimized VGG and . Network is for and for (green vertical lines). Dotted lines are for training and solid lines for validation. Remaining plots are similar interpolations but between teleported versions of and with CoB range of (b) , (c) , and (d) .

Proof of Theorem 5.1

We will introduce the notation for a forward pass on a neural network with hidden layers without bias vertices for clarity. For the forward pass of we first fix a data sample and define the vector of activation outputs of neurons at layer by , and the vector of pre-activations at layer by . We will denote by the vector of activation functions at layer . For the input layer we have . Next, we define inductively

(4)

for every .

In the case of the backward pass, we denote the vector of derivatives of activations in layer by , and the vector of derivatives with respect to the activation outputs of neurons on layer by . On the output layer we have

(5)

where is the loss function. If we denote by the point-wise (Hadamard) product of vectors of the same size, then inductively from layer down to the input layer we have

(6)

and also that

(7)

If is an isomorphism of neural networks, given by a choice of change of basis in every hidden neuron and we denote by the vector of change of basis for layer , then

(8)

and for the activation functions we have

(9)

Where the operation on the left of a matrix, multiplies its columns by the coordinate values of vector . While the operation on the right of a matrix, multiplies its rows by the coordinate values of the vector . By transposing Eq. 8, we obtain

(10)
Figure 16: Mean absolute difference ( std-dev) between the back-propagated gradients’ magnitudes of teleported networks and their original, not teleported, network. Larger CoB generate larger gradients.

By the chain rule and Eq. 

9 we can see that

(11)

Also, from the proof of Theorem 4.9 [1],

(12)
Figure 17: Validation plots produced over runs over three learning rates (, and ) of the four models MLP, VGG, ResNet and DenseNet with normal (top row) and Xavier (bottom row) initializations on CIFAR-100.
Figure 18: Validation plots produced over runs over three learning rates (, and ) of the four models MLP, VGG, ResNet and DenseNet with pseudo-teleportation on CIFAR-100.
Theorem 6.1

Let be a teleportation of the neural network with respect to the CoB . Then

for every layer of the network.

Proof: We proceed by induction on the steps of the backward pass of the network . By Theorem 4.9 of [1], both networks and give the same output , so by Eq. 5 we get that . We can substitute this together with Eq. 12 into Eq. 6 applied to the neural network to obtain

Now, the derivative of the loss function with respect to the activation outputs of layer in can be written analogously to Eq. 7, in which we substitute Eq. 10 taking into account that the CoB for the output layer is given by ’s, so we get

At layer we get that

We then observe that

For the inductive step, we assume the result to be true for layers . We will prove the result holds for layer . Indeed,

It can be appreciated that back-propagation on computes a re-scaling of the gradient of by the CoB, just as claimed in the statement of the theorem

Convergence speed up

Following section 5.2, we present in figure 17 the validation accuracy box plots obtained with normal and Xavier initializations on the CIFAR-100 dataset. We also show in figure 18 some more pseudo-teleportation results for MLP, VGG, ResNet and DenseNet on CIfAR-100.