tourbillon-pytorch
Unofficial (and not yet working) implementation of "Tourbillon: a Physically Plausible Neural Architecture"
view repo
In a physical neural system, backpropagation is faced with a number of obstacles including: the need for labeled data, the violation of the locality learning principle, the need for symmetric connections, and the lack of modularity. Tourbillon is a new architecture that addresses all these limitations. At its core, it consists of a stack of circular autoencoders followed by an output layer. The circular autoencoders are trained in self-supervised mode by recirculation algorithms and the top layer in supervised mode by stochastic gradient descent, with the option of propagating error information through the entire stack using non-symmetric connections. While the Tourbillon architecture is meant primarily to address physical constraints, and not to improve current engineering applications of deep learning, we demonstrate its viability on standard benchmark datasets including MNIST, Fashion MNIST, and CIFAR10. We show that Tourbillon can achieve comparable performance to models trained with backpropagation and outperform models that are trained with other physically plausible algorithms, such as feedback alignment.
READ FULL TEXT VIEW PDFUnofficial (and not yet working) implementation of "Tourbillon: a Physically Plausible Neural Architecture"
In physical neural systems, whether biological or neuromorphic, backpropagation is faced with a number of problems (e.g. bengio2015towards; nokland2016direct; baldi2016local; guerguiev2017towards; baldi2018learning; lillicrap2020backpropagation. Of greatest relevance here are the following problems:
Labeling: the need for large quantities of labeled data to compute gradients for supervised learning.
Locality: in a physical neural system, the learning rule for adjusting the synaptic weights must be local, i.e. depend only on variables that are available locally, both in space and time, at the synapse.
Symmetry of Connections: backpropagation requires precisely symmetric connections between the forward and backward passes that may be hard to realize in a physical neural system.
Distances: backpropagation requires propagating signals over significant neural distances, which could lead to signal dilution, and distorted or unstable gradients.
Developmental Modularity: backpropagation in general requires having a complete architecture in place before training can begin, which may not be suitable under certain developmental constraints.
A number of solutions have been suggested to try to address these problems, in isolation or in small combinations, but no approach addresses all of them at once well. The Tourbillon architecture proposes to address all of them at once by combining different ideas including stacked autoencoders, random backpropagation, and recirculation. We stress that the goal here is to address the problems above in physical neural systems, and not to derive a new architecture or algorithm that is practically useful for current engineering applications of deep learning.
One well known approach for dealing with the Labeling problem is to use a stack of autoencoders (hinton2006fast; baldi2012autoencoders)
, where each autoencoder in the stack is trained in self-supervised manner to reproduce the hidden representation produced by the previous autoencoder in the stack. The intuition is that such a stack can learn increasingly more abstract and powerful representations of the input data without the need for labels. Labels can be used at the top of the architecture to train the top layer in a supervised manner by stochastic gradient descent, with the additional option of backpropagating through all the layers to fine-tune the entire stack (although there is some debate in the field as whether the latter is helpful or not
(tschannen2018recent)). The stacked-autoencoder approach has the potential for addressing two other problem: Distance and Developmental Modularity. Because backpropagation occurs inside each autoencoder in the stack, error gradients are potentially propagated over shorted distances confined to the size of the individual autoencoders. Furthermore, individual autoencoders can be trained at least as soon as all the previous autoencoders in the stack are trained and possibly earlier, avoiding the need to wait for the entire architecture to be wired before training can begin.The stack of autoencoders approach however does not address the problems of Locality and Symmetry of the forward weights. Each autoencoder is deep by definition, in the sense that it has at least one hidden layer, and therefore it requires applying the backpropagation algorithm across at least two adaptive layers of weights. This in turn requires having a deep learning channel (i.e. "wires") for transmitting the error signal and making it local to the hidden layer(s). Furthermore, the weights on these wires have to be symmetric to the forward weights in order to implement standard backpropagation.
Feedback alignment (FA) or random backpropagation (RBP) refers to a family of algorithms (lillicrap2016random; baldiRBP2016AI) that address the Symmetry problem by using non-symmetric random weights in the backward pass. For a forward weight , the backpropagation learning equation can be written as:
(1) |
where denotes the learning rate, denotes the postsynaptic backpropagated error, and denotes the presynaptic activity. In contrast, RBP can be expressed as:
(2) |
where denotes the postsynaptic, randomly backpropagated, error. Many flavors of RBP can be obtained by using different notions of randomness to select the backward weights, by including various skipped backward connections, by making the backward weights adaptive, and so forth (baldiRBP2016AI; baldiSymmetries2017)
. While RBP addresses the Symmetry problem, by itself it does not address the other problems. Furthermore, while RBP simulations have succeeded on the MNIST and CIFAR benchmark data sets, it has been noted that RBP algorithms do not work well with convolutional layers
(bartunov2018assessing; moskovitz2018feedback; refinetti2020dynamics). A few methods have been proposed to address this apparent weakness of RBP algorithms, however, most of them are not biologically plausible. For instance, liao_backpropagation_2015 proposes to use uniform sign-concordant feedback (uSF), where each forward matrix in the forward pass is replaced by the matrixin the backward pass. They also pointed that normalization methods such as batch normalization and batch Manhattan can improve the performance of RBP when dealing with convolution layers.
moskovitz2018feedback modifies the uSF method by putting additional constraints on the backward weight matrices. In all these approaches, the backward weight matrices must have some knowledge about the forward matrices.In a standard feed forward autoencoder, the data itself provides the targets (self-supervised learning). The data and hence the targets are available in the input layer. However, they are not available in the output layer, in the sense that they are not physically local to the output layer. This problem is addressed by circular autoencoders where the output layer is physically equal (or physically adjacent) to the input layer (Figure 1). With the circular layout, targets and errors can be computed at the level of the input/output layer. Standard backpropagation, or even standard RBP, of these targets would require a channel (wires) running backward from the output layer to the hidden layer. However, because of the circular layout, one may suspect that it may be possible to use the forward connections to propagate target and error information during learning. This is the fundamental idea behind recirculation, a family of algorithms for training circular autoencoders that do not require backward connections (hinton1988learning; o1996biologically; baldi2018learning).
Consider a circular autoencoder with layers numbered from to , where and correspond to the input layer and use the index to denote different feed forward passes through the autoencoder, with the first pass indexed by . After the first pass, one can locally compute the error , where is the target, i.e. the original data in the case of an autoencoder. This error could be used to train the top layer of the circular autoencoder by gradient descent, and then train the other layers by using a form of RBP where the error signal is obtained by propagating the error using the forward weights of the circular autoencoder. This however requires propagating two different kinds of signals, activities, and errors, through the circular autoencoder. Thus rather than recirculating the error, a more uniform approach can be obtained by recirculating activities. If denotes the output of layer during the forward pass indexed by , the main idea behind the recirculation family of algorithms is to use as the target for the output taken at a later time . The intuition behind this is that the data may become increasingly corrupted as it is being recycled. Several different algorithms can be obtained, by varying, for instance, the choices of and and the manner in which outputs from different passes are combined (e.g. convex combinations) (baldi2018learning). Here we will use the simplest form, which often works the best, based only on the plain activities of the first two passes corresponding to and . Thus, for any weight in the circular autoencoder, the corresponding learning-by-recirculation equation becomes:
(3) |
This rule has a Hebbian-product form between a post-synaptic term and a pre-synaptic term, and is similar to backpropagation, except that the postsynaptic backpropagated error is replaced by the postsynaptic recirculation error
. This error term is local in space and is assumed to be also local in time. This requires the assumption that two consecutive passes through the circular autoencoder fall within the time window that defines locality in a particular physical system. For the input layer, the vector
corresponds to the input data, hence also to the targets in the case of an autoencoder. Thus the recirculation learning equation for the top layer of weights is exactly equal to backpropagation (stochastic gradient descent ). Having stochastic gradient descent applied to the top layer is also essential for RBP to work. The recirculation errorcan also be thought of as a derivative of the activity. Thus, in short, recirculation learning rules rely on the product of the pre-synaptic activity by some measure of change in the post-synaptic activity, which is used to communicate error information. Although in this work we are not using spiking neurons, such learning rules are closely related to the concept of spike time-dependent synaptic plasticity (STDP). STDP Hebbian or anti-Hebbian learning rules have been proposed using the temporal derivative of the activity of the post-synaptic neuron
(NIPS1999_1658) to encode error derivatives. In the Appendix, some of the simulations also explore a slightly different anti-Hebbian learning rule:(4) |
where the pre- and post-synaptic terms are more similar in form. This rule is applied to all the layers, but the top adaptive layer where the standard SGD rule is used.
We propose the Tourbillon architecture as a stack of circular autoencoders followed by a partially or fully-connected layer between the hidden representation of the top circular autoencoder and the output layer. Each circular autoencoder is broken down into two components, the encoder, and the decoder components. The hidden layer that is shared by the encoding and decoding components of a circular autoencoder is called the hinge hidden layer. In the stack, the hinge hidden layer for the circular autoencoder at level , becomes the input layer for the next circular autoencoder at level in the stack.
The circular autoencoders are trained by recirculation (self supervised manner) and the top layer can be trained in a supervised manner by stochastic gradient descent, with the option of continuing the error propagation through the entire stack, as a form of RBP using non-symmetric connections. In a physical neural system, this requires that the targets be locally available at the final output layer. This architecture addresses all five problems mentioned in the introduction. Most of the training is unsupervised, all the learning rules are local in space and time, the forward and backward connections are not symmetric, information is propagated only over relatively short distances, the development is modular, and also convolutions can be incorporated into this framework. To prove its viability, we test this architecture on several benchmark problems in the Experiments section.
Although we simulate Tourbillon in its most basic version, there are a number of possible variations on the basic Tourbillon idea and training algorithms. In particular, in the basic version, the stack of autoencoders is trained sequentially bottom-up starting from the initial input layer. It is possible to consider more asynchronous modes, where for instance any circular autoencoder in the stack is being trained as soon as it receives an input from a lower autoencoder.
Once the stack of circular autoencoders is in place, possibly after a developmental phase, the connections of their decoder components provide a deep learning channel, i.e. a physical channel through which error obtained at the top of the architecture could be communicated to the lower layers. This channel, whose weights are initially learned by recirculations, can be used to fine-tune the entire Tourbillon architecture, using a form of random backpropagation, since there is no imposed symmetry between the weights of the encoder and decoder networks of each circular autoencoder. This produces a kind of RBP error signal for the hinge hidden layer of each circular autoencoder. If the encoder component of circular autoencoders has multiple layers, different possibilities can be considered for the fine-tuning phase when the error signal arrives at the hinge. In all the simulations, we use circular autoencoders with a single hidden layer, and thus the RBP error that arrives at the hinge is local to the corresponding encoder weights and these can be fine-tuned by standard RBP (Equation 2).
Due to the modularity of the Tourbillon architecture, it can be composed in many different ways. In particular, it can be utilized to build a physically plausible twin architecture for every physically non-plausible neural network trained with backpropagation or random backpropagation. Although the physically plausible architecture does not outperform its twin, this idea opens the door for further studies of physically plausible approaches. Here we explain the process of building the twin architecture through two examples: (1) Tourbillon U-Net; and (2) Tourbillon recursion. We also provide the general algorithm of building the physically plausible twin architecture in Algorithm
LABEL:alg:tourb.The first example is the “tourbillonization” of a U-Net (ronneberger2015u) architecture. The U-Net architecture is basically a feed forward autoencoder with a multi-layer encoder, and a multi-layer decoder, as well as additional skip connections (which we ignore for simplicity). This architecture was originally developed for image segmentation problems, where the segmented images are used as the targets. Each layer of a U-Net architecture can be replaced by a circular autoencoder to build a Tourbillon version of U-Net. The only non-local aspect of the Tourbillon U-Net is the need for targets in the final output layer. This could be addressed by turning U-Net itself into a circular autoencoder, which leads to the second example where the Tourbillon approach can be used recursively, by replacing each hidden layer of a circular autoencoder, with a circular autoencoder, a recursion that in principle could be iterated several times. In the later sections we experiment the tourbillonizaiton of U-Net, however, how a Tourbillon of Tourbillons architectures can be trained efficiently is left out for future work.
algocf[h]
In the experiments, we train circular autoencoders and Tourbillon architectures capable of supporting feed forward fully-connected layers, 2D convolutions, 2D max pooling, 2D up sampling, reshaping, and five different non-linear activation functions.
In the experiments, we use three well-known benchmark datasets: (1) MNIST, comprising 70,000 gray-scale images of handwritten digits of size with each pixel normalized to the range. We used the normal 60000 and 10000 train-test split; (2) Fashion MNIST comprising 70,000 gray-scale images of fashion accessories of size with each pixel normalized to the range. We used the normal 60000 and 10000 train-test split; and (3) CIFAR-10 comprising 60,000 images of size with three RGB channels corresponding to 10 object classes with each pixel normalized to the range. We used the normal 50000 and 10000 train-test split. For the following experiments, the models are trained using a single NVidia Titan X GPU.
Since the building blocks of the Tourbillon architecture are circular autoencoders, we train compressive circular autoencoders using the recirculation algorithm (Equation 3) and compare the mean-squared reconstruction loss (similar results are obtained with the relative entropy loss) with that of the corresponding architecture trained with both backpropagation and random backpropagation. For experiments with MNIST and Fashion MNIST, the circular autoencoder consists of two feed forward fully-connected layers, where the middle layer is of size 256. For experiments with CIFAR10, the circular autoencoder consists of two 2D convolution layers with kernels of size to construct a hidden representation of size
in the hinge layer. Both convolutional layers use strides of size
with zero padding to preserve the size of the input. Figure
3 shows the training and test error curves for the MNIST and Fashion MNIST datasets. Figure 8, first row, shows the training and test error curves for the convolutional circular autoencoder trained using the CIFAR10 dataset. In all these experiments, recirculation leads to reconstruction errors that are comparable, and possibly even better, than the corresponding errors obtained using backpropagation and random backpropagation. In all these experiments, the models are trained with a mini-batch of size 64 to minimize the reconstruction error. Across all three models, all hidden layers use a tanh activation function and the final layer uses a logistic activation function (since the pixels are normalized to ). During our experiments, we observed that the presence of non-linear activation functions in intermediate layers is essential for learning, especially for the circular autoencoder which is consistent with the results provided in baldi2016learning. For all models, a grid search was used to optimize the hyperparameters. As a result, models trained with backpropagation and random backpropagation across all three datasets use a starting learning rate of 0.01 for all the layers, with a momentum term of 0.8. On the other hand, for the circular autoencoder traioned using MNIST and Fashion MNISt datasets, the grid search resulted in starting learning rates of 0.01 and 0.001 for the first and second layers respectively. For the circular autoencoder trained using CIFAR10 dataset, the grid search resulted in starting learning rate of 0.001 and 0.0001 for the first and second convolutional layers respectively. All these learning rates are also decayed by a rate of
.Here we stack trained circular autoencoders to build a Tourbillon architecture. The input data to each of the circular autoencoders correspond to the hidden representations produced in the hinge layer of the previous circular autoencoder in the stack. We perform three classification experiments using MNIST, Fashion MNIST, and CIFAR10 and compare the performance of Tourbillon with two neural networks with an equivalent architecture trained with backpropagation and random backpropagation. As shown in Figures 4 and the second row of Figure 8, Tourbillon outperforms random backpropagation and is comparable to standard backpropagation, especially when using feed forward fully-connected architectures. For MNIST and Fashion MNIST, we use the same architecture where the Tourbillon consists of two trainable circular autoencoders, the first tasked with compressing the data size from 784 to 256, and the second tasked with compressing the hinge hidden representation from 256 to 64. This is followed by a fully-connected layer with a softmax activation function to perform the final classification. Each of the circular autoencoders is trained by recirculation with the same hyperparameters used in Section 6.2. The top layer is initialized with Glorot initialization (glorot_2010) and, after a hyperparameter grid search, its weights are optimized with a learning rate of 0.01 and a momentum of 0.8. Similarly, for the CIFAR10 dataset, the Tourbillon architecture comprises two trainable circular autoencoders. The two circular autoencoders are similar to the circular autoencoder in Section 6.2 for the CIFAR10 dataset. The hinge hidden representation of the first circular autoencoder is of size . The hinge hidden representation of the second circular autoencoder is of size . The encoder component of these two circular autoencoders is provided by a 2D convolutional layer with tanh activation functions and the decoder component consists of another 2D convolutional layer with a logistic activation function. We use the same hyperparameters as we used in Section 6.2 for training a circular autoencoder using the CIFAR10 dataset.
As described in Section 5.2, following the Algorithm LABEL:alg:tourb, we build the physically plausible twin architecture for a U-Net autoencoder architecture that addresses all the problems mentioned in the Introduction. In addition, we also “tourbillonize” a feed forward fully-connected network architecture. Results obtained on MNIST and Fashion MNIST are very similar, therefore here we report the results obtained on MNIST and CIFAR10. The U-Net architecture for the MNIST dataset comprises a two-layer encoder and a two-layer decoder. The encoder layers compress the data from 784 to 128, and then from 128 to 64. The decoder layers expand the data from 64 to 128, and then from 128 to 784. The U-Net architecture for the CIFAR10 dataset comprises two 2D convolutional layers with kernels of size . For both the MNIST and CIFAR10 U-Nets we use the same mini-batch size, learning rate, decay rate, momentum, and initializations as described in Section 6.2. After building the twin Tourbillon architectures and training their circular autoencoders by recirculation, we apply fine-tuning as described in Section 5.1 to perform random backpropagation using the channel provided by the decoders’ connections. Empirically, we found that it is necessary to have different learning rate schedules for different layers during the fine-tuning phase to ensure a form of asynchronous learning among layers(see also refinetti2020dynamics). A summary of the results comparing the various approaches is given in Table 1 (first row), and shows that the performance of the Tourbillon twin approach is comparable to backpropagation.
For the feed forward architecture, we use a three-layer network with 256, 64, and 10 hidden units. For this architecture and its Tourbillon twin, we use the same hyperparameters as in Section 6.2. During the fine-tuning of the Tourbillon, the learning rate schedules of each layer are similar to the rates used for the U-Net Tourbillon. A summary of the results comparing the various approaches is given in Table 1 (second row), and shows again that the performance of the Tourbillon twin approach is comparable to backpropagation.
Architecture | MNIST | CIFAR10 | |||
---|---|---|---|---|---|
Train | Test | Train | Test | ||
U-Net | BP | 0.0031 (5.8 e-05) | 0.0031 (5.8 e-05) | 0.0024 (5.8 e-05) | 0.0026 (e-04) |
Twin | 0.0111 (e-04) | 0.0113 (2e-04) | 0.0733 (3.2e-04) | 0.0783 (3.2e-04) | |
FC | BP | 1.04 (0.20) | 1.18 (0.20) | - | - |
Twin | 5.21 (0.40) | 5.82 (0.40) | - | - |
Performance comparison of networks trained with backpropagation vs their physically plausible Tourbillon twin. For the U-Net experiment, the numbers represent the mean-squared reconstruction error. For the feed forward fully-connected architecture, the numbers represent the classification error on the test dataset. For every reported number, the standard deviation is also shown in the parenthesis. For the sake of brevity, FC stands for fully-connected network.
In a physical neural system, implementation of backpropagation is faced with a number of significant challenges described in the Introduction. Here we have presented a systematic approach we call Tourbillon for deriving physically plausible architectures addressing most if not all of them. At its core, Tourbillon relies on a stack of circular autoencoders trained by recirculation. Tourbillon architectures are modular and can be trained in large part in self-supervised mode using learning rules that are local in both space and time, without the need for weight transport.
The name Tourbillon captures the turbulent topology of the architecture. Moreover, in horology, a tourbillon is an addition to the mechanics of a watch escapement to increase its accuracy. While we do not claim to have increased accuracy, we have shown that the Tourbillon approach achieves results that are comparable to backpropagation, at least on MNIST, Fashion MNIST, and CIFAR10.
In the development of Tourbillon, several issues have been identified that will require additional research. The first one is to study whether local learning algorithms can be developed to improve the fine-tuning phase of Tourbillon architectures where the encoder components of the circular autoencoders have multiple hidden layers. Something along the lines of recirculating the RBP error may be a possibility.
The second one is to develop local learning algorithms for the recursive Tourbillon approach, starting with Tourbillions of Tourbillons. In particular, one would like to know whether the fine-tuning phase is necessary and how to carry it, perhaps using a fast form of recirculation in the smaller circular autoencoders and a slower form of recirculation in the larger circular autoencoder.
The third one is the issue of convolutions that were only incrementally addressed here by showing that Tourbillon with convolutions does better than RBP, but lags behind BP. When convolutions are used, neurons in a convolution layer must have identical incoming weights, which may not be easy to realize in a physical neural system. The most plausible approach to address the convolution problem in a physical system is to first initialize the weights in similar, but not identical, fashion across the entire convolutional layer, using for instance normal weights with small standard deviation. This ensures that all the weights are similar, without enforcing them to be identical. Second, during training, each convolution neuron in the layer learns independently of the other convolution neurons, without enforcing any kind of rigid weight sharing. What is essential, however, is to ensure that all the convolution neurons in the layer see exactly, or approximately, the same training data in aggregate. This is easily achieved through data augmentation by translating each training image in all possible directions, something that may happen automatically in the real world due to moving objects, or head/eye motions. With this data augmentation, the weights of the convolution neurons remain similar throughout training, since they are trained on the same data, without any exact weight sharing (ott2020learning). This approach ought to be tried in Tourbillon.
Tourbillon provides a framework for investigating these and other issues and broadens the horizon of ongoing research into physically plausible deep learning.
In this section, we first provide more explanation on the new learning rule introduced in Equation 4. Then we include evaluation plots corresponding to: (1) The performance of the circular autoencoder and Tourbillon models trained using the CIFAR10 dataset and their comparison to models trained using random backpropagation and backpropagation; (2) The representation power of Tourbillon models by visualizing the reconstructed images; and (3) The process of tourbillonization.
Here we rewrite the symmetric learning rule we introduced earlier:
(5) |
This learning rule has symmetric terms for the pre- and post-synaptic activation differences. Although the loss trajectory is smoother than when using the main recirculation rule (Figure 5), the algorithm becomes trapped into a mode-collapse state where the reconstructed images are the mean of the entire dataset (Figure 6). Further studies of this learning rule and its mode collapse are left for future work.
The first row of Figure 8 represents the training and test performance of a circular autoencoder trained by recirculation using the CIFAR10 dataset and its comparison to an autoencoder with similar architecture trained with random backpropagation and backpropagation.
Figure 9 shows the reconstructed images using a stack of two circular autoencoders. We also use the tSNE [van2008visualizing] dimensionality reduction method in Figure 7 to visualize the output of circular autoencoders used in Section 6.3. These two figures show the representational power of the Tourbillon building blocks.