SynapticMetaplasticityBNN
Paper submission
view repo
While deep neural networks have surpassed human performance in multiple situations, they are prone to catastrophic forgetting: upon training a new task, they rapidly forget previously learned ones. Neuroscience studies, based on idealized tasks, suggest that in the brain, synapses overcome this issue by adjusting their plasticity depending on their past history. However, such "metaplastic" behaviour has never been leveraged to mitigate catastrophic forgetting in deep neural networks. In this work, we highlight a connection between metaplasticity models and the training process of binarized neural networks, a low-precision version of deep neural networks. Building on this idea, we propose and demonstrate experimentally, in situations of multitask and stream learning, a training technique that prevents catastrophic forgetting without needing previously presented data, nor formal boundaries between datasets. We support our approach with a theoretical analysis on a tractable task. This work bridges computational neuroscience and deep learning, and presents significant assets for future embedded and neuromorphic systems.
READ FULL TEXT VIEW PDFPaper submission
In recent years, deep neural networks have experienced incredible developments, outperforming the state-of-the-art, and sometimes human performance, for tasks ranging from image classification to natural language processing
[Lecun2015]. Nonetheless, these models suffer from catastrophic forgetting [Goodfellow2013, Kirkpatrick2016] when learning new tasks: synaptic weights optimized during former tasks are not protected against further weight updates and are overwritten, causing the accuracy of the neural network on these former tasks to plummet [french1999catastrophic, mcclelland1995there] (see Fig. 1(a)). Balancing between learning new tasks and remembering old ones is sometimes thought of as a trade-off between plasticity and rigidity: synaptic weights need to be modified in order to learn, but also to remain stable in order to remember. This issue is particularly critical in embedded environments, where data is processed in real-time without the possibility of storing past data. Given the rate of synaptic modifications, most artificial neural networks were found to have exponentially fast forgetting [Fusi2005]. This contrasts strongly with the capability of the brain, whose forgetting process is typically described with a power law decay [wixted1991form], and which can naturally perform continuous learning.The neuroscience literature provides insights about underlying mechanisms in the brain that enable task retention. In particular, it was suggested by Fusi et al.[Fusi2005, Benna2016] that memory storage requires, within each synapse, hidden states with multiple degrees of plasticity. For a given synapse, the higher the value of this hidden state, the less likely this synapse is to change: it is said to be consolidated. These hidden variables could account for activity-dependent mechanisms regulated by intercellular signalling molecules occurring in real synapses [abraham1996metaplasticity, Abraham2008]. The plasticity of the synapse itself being plastic, this behaviour is named “metaplasticity”. The metaplastic state of a synapse can be viewed as a criterion of importance with respect to the tasks that have been learned throughout and therefore constitutes one possible approach to overcome catastrophic forgetting.
Until now, the models of metaplasticity have been used for idealized situations in neuroscience studies. However, intriguingly, in the field of deep learning, binarized neural networks [Courbariaux2016]
(or the closely related XNOR-NETs
[rastegari2016xnor]) have a remote connection with the concept of metaplasticity that has so far never been explored. Binarized neural networks are neural networks whose weights and activations are constrained to the values and . These networks were developed for performing inference with low computational and memory cost [conti2018xnor, bankman2018always, hirtzlin2019digital], and surprisingly, can achieve excellent accuracy on multiple vision [rastegari2016xnor, lin2017towards] and signal processing [bogdan]tasks. The training procedure of binarized neural networks involves a real value associated to each synapse which accumulates the gradients of the loss computed with binary weights. This real value is said to be “hidden”, as during inference, we only use its sign to get the binary weight. In this work, we interpret the hidden weight in binarized neural networks as a metaplastic variable that can be leveraged to achieve multitask learning. Based on this insight, we develop a learning strategy using binarized neural networks to alleviate catastrophic forgetting with strong biological-type constraints: previously-presented data can not be stored, nor generated, and the loss function is not task-dependent with weight penalties.
An important benefit of our synapse-centric approach is that it does not require a formal separation between datasets, which also allows the possibility to learn a single task in a more continuous fashion. Traditionally, if new data appears, the network needs to relearn incorporating the new data into the old data: otherwise the network will just learn the new data and forget what it had already learned. Through the example of the progressive learning of datasets, we show that our metaplastic binarized neural network, by contrast, can continue to learn a task when new data becomes available, without seeing the previously presented data of the dataset. This feature makes our approach particularly attractive for embedded contexts. The spatially and temporally local nature of the consolidation mechanism makes it also highly attractive for hardware implementations, in particular using neuromorphic approaches.
Our approach takes a remarkably different direction than the considerable research in deep learning that is now addressing the question of catastrophic forgetting. Many proposals consist in keeping or retrieving information about the data or the model at previous tasks: using data generation [shin2017continual], the storing of exemplars [rebuffi2017icarl], or in preserving the initial model response in some components of the network [li2017learning]. These strategies do not seem connected to how the brain avoids catastrophic forgetting, need a very formal separation of the tasks, and are not very appropriate for embedded contexts. A solution to solve the trade-off between plasticity and rigidity more connected to ours is to protect synaptic weights from further changes according to their “importance” for the previous task. For example, elastic weight consolidation [Kirkpatrick2016] uses the diagonal elements of the Fisher information matrix of the model distribution with respect to its parameters to identify synaptic weights qualifying as important for a given task. In another work [Zenke2017], the consolidation strategy consists in computing an importance factor based on path integral. Finally, [aljundi2018memory] uses the sensitivity of the network with respect to small changes in synaptic weights. In all these techniques, the desired memory effect is enforced by changing the loss function and does not emerge from the synaptic behaviour itself. This aspect requires a very formal separation of the tasks, and makes these models still largely incompatible with the constraints of biology and embedded contexts. The highly non-local nature of the consolidation mechanism also makes it difficult to implement in neuromorphic-type hardware.
Specifically, the contributions of the present work are the following:
We interpret the hidden real value associated to each weight (or hidden weight) in binarized neural networks as a metaplastic variable, we propose a new training algorithm for these networks adapted to learning different tasks sequentially (Alg. 1).
We show that our algorithm allows a binarized neural network to learn permuted MNIST tasks sequentially with an accuracy equivalent to elastic weight consolidation, but without any change to the loss function or the explicit computation of a task-specific importance factor. More complex sequences such as MNIST - Fashion-MNIST can also be learned sequentially with test accuracy on both tasks having no degradation with respect to the accuracy reached on a single task.
We show that our algorithm enables to learn the Fashion-MNIST and the CIFAR-10 datasets by learning sequentially each subset of these datasets, which we call the stream-type setting.
We show that our approach has a mathematical justification in the case of a tractable quadratic binary task where the trajectory of hidden weights can be derived explicitly.
The training process of conventional binarized neural networks relies on updating hidden real weights associated with each synapse, using loss gradients computed with binary weights. The binary weights are the signs of the hidden real weights, and are used in the equations of both the forward and backward passes. By contrast, the hidden weights are updated as a result of the learning rule, which therefore affects the binary weights only when the hidden weight changes sign - the detailed training algorithms are presented in Supplementary Algorithms 1 and 2 of Supplementary Note 1. Hidden weights magnitudes have no impact on inference: two given binary weights of a binarized neural network may be equal to one, but their corresponding hidden weight may differ depending on the history of the training process.
Described as such, the training process of binarized neural networks is intriguingly similar to the one of metaplastic Hopfield networks in [Fusi2005]: it prescribes to binarize the weight for the computation of the preactivations or “synaptic currents”, and to update a metaplastic hidden variable for learning. This comparison suggests that the hidden weights in binarized neural networks could also be used as metaplastic variables. In our work, we show that we can use the hidden weights as a criterion for importance to learn several tasks sequentially with one binarized neural network, which involves one single set of synaptic weights. However, for this purpose, the training procedure of binarized neural networks needs to be adapted. Based on the work of Fusi [Fusi2005]
, our intuition is that binary weights with high hidden weight values are relevant to the current task and can be consolidated: the learning process should ensure that the greater the hidden real value, the more difficult to switch back. For lack of such a blocking mechanism, there cannot be a long term memory across tasks since the number of updates required to learn a given task is heuristically equal to the number of updates required to unlearn it. We therefore introduce the function
to provide an asymmetry which differentiates between updates towards zero hidden weights and away from zero for equivalent gradient absolute values (see Fig. 1(b)). The higher the hidden weight, the more difficult it is for the binary weight to switch sign, which is very similar in spirit to the cascade of metaplastic states introduced in [Fusi2005]. The strength of the metaplasticity effect is characterized by the real parameter of function (see Fig. 1(c)), the case corresponding to the conventional binarized neural network case. The detailed training algorithm is provided in Algorithm 1, and its practical implementation is described in Methods.We first test the validity of our approach by learning sequentially multiple versions of the MNIST dataset where the pixels have been permuted, which constitutes a canonical benchmark for continual learning [Goodfellow2013]. We train a binarized neural network with two hidden layers of 4,096 units using Algorithm 1 with several metaplasticity values and 40 epochs per task (see Methods). Fig. 2 shows this process of learning six tasks. The conventional binarized neural network () is subject to catastrophic forgetting: after learning a given task, the test accuracy quickly drops upon learning a new task. Increasing the parameter gradually prevents the test accuracy on previous tasks from decreasing with eventually the binarized neural network (Fig. 2(d)) managing to learn all six tasks with test accuracies comparable with the test accuracy achieved by the BNN trained on one task only (see Table. 1).
Figs. 2(g) and 2(h) show the distribution of the metaplastic hidden weights after learning Task 1 and Task 2 in the second layer. The consolidated weights of the first task in Fig. 2(g) correspond to hidden weights between zero and five in magnitude. We observe in Fig. 2(g) that around of binary weights still have hidden weights near zero after learning one task. These weights correspond to synapses that repeatedly switched between and binary weights during the training of the first task, and thus of little importance for the first task. These synapses were therefore not consolidated, and are then available for learning another task, as shown in Fig. 2(h). After learning the second task (Fig. 2(h)), we can distinguish between hidden weights of synapses consolidated for Task 1 and for Task 2.
No consolidation | Random | Elastic Weight | Metaplasticity | |
---|---|---|---|---|
() | Consolidation | Consolidation | () | |
Task 1 | ||||
Task 2 | ||||
Task 3 | ||||
Task 4 | ||||
Task 5 | ||||
Task 6 |
Table 1 presents a comparison of the results obtained using our technique with a random consolidation of weights, and with elastic weight consolidation [Kirkpatrick2016], implemented on the same binarized neural network architecture (see Methods). We see that the random consolidation approach does not allow multitask learning. On the other hand, our approach achieves a performance similar to elastic weight consolidation for learning six permuted MNISTs with the given architecture, although unlike elastic weight consolidation the consolidation is based on an entirely local rule without changing the loss function.
Supplementary Figure 1 shows a more detailed analysis of the performance of our approach when learning up to ten MNIST permutations, and for varying sizes of the binarized neural network, highlighting the connection between network size and its capacity in terms of number of tasks.
As a control experiment, we also applied Algorithm 1 to a full precision network, except for the weight binarization step described in line one. Fig. 2(e) and Fig. 2(f) show the final accuracy of each task at the end of learning for a binarized neural network and a real valued weights deep neural network respectively, with the same architecture. The full precision network final test accuracy of each task for the same range of values cannot retain more than three tasks with accuracy above . This highlights that our weight consolidation strategy is tied specifically to the use of a binarized neural network.
This experimental result points out the fundamentally different meaning of hidden weights in a binarized neural network and of real weights in a full precision neural network respectively. In full precision networks, the inference is carried out using the real weights, in particular the loss function is also computed using these weights. Conversely in binarized neural networks, the inference is done with the binary weights and the loss function is also evaluated with these binary weights, which has two major consequences. First, the hidden weights do not undergo the same updates as the weights of a full precision network. Second, a synapse whose hidden weight is positive and which is prescribed a positive update consequently will not affect the loss, nor its gradient at the next learning iteration since it only takes into account the sign of the hidden weights. Hidden weights in binarized neural networks consequently have a natural tendency to spread over time (Fig. 2(g,h)) they are not technically weights, but a trace of the history of the network updates that is relevant for memory effects.
To test further the ability of our binarized neural network to learn several tasks sequentially, we sequentially train a binarized neural network on two tasks in a more difficult situation. When learning permuted versions of MNIST, the relevant input features do not overlap extensively between tasks which makes it easier for the network to learn sequentially. For this reason, we now train a binarized neural network with two hidden layers of 4,096 units to learn sequentially the MNIST dataset and the Fashion-MNIST dataset [xiao2017fashion] which consists of fashion items images belonging to ten classes. Fig. 3(b) shows the result of the training of a binarized neural network, with 50 epochs on MNIST and 50 epochs on Fashion-MNIST (Fig. 3(d) shows the reverse training order). Figs. 3(a) and (c) also show the result for the conventional binarized neural network (). Baselines define the accuracies the binarized neural network would have obtained had it been trained on each of these tasks separately. In the context of Fig. 3, we consider the baseline of Fashion-MNIST is realized in Fig. 3(a) (orange curve after 100 epochs) and the baseline of MNIST in Fig. 3(c) (blue curve after 100 epochs). We observe that the metaplastic binarized neural network is able to learn both tasks sequentially with baseline accuracies regardless of the order chosen to learn the tasks.
We have shown that the hidden weights of binarized neural networks can readily be used as importance factors for synaptic consolidation. Therefore, in our approach, it is not required to compute an explicit importance factor for each synaptic weight. Our consolidation strategy is carried out simultaneously with the weight update, and locally in space as consolidation only involves the hidden weights. The absence of formal dataset boundaries in our approach is important to tackle another aspect of catastrophic forgetting where all the training data of a given task is not available at the same time. In this section, we use our method to address this situation, which we call “stream learning”: the network learns one task but can only access one subset of the full dataset at a given time. Subsets of the full dataset are learned sequentially and the data of previous subsets cannot be accessed in the future.
We first consider the Fashion-MNIST dataset, split into 60 subsets presented sequentially during training (see Methods). The learning curves for regular and metaplastic binarized neural networks are shown in Fig. 4(a), the dashed lines corresponding to the accuracy reached by the same architecture trained on the full dataset after full convergence. We observe that the metaplastic binarized neural network trained sequentially on subsets of data performs as well as the non-metaplastic binarized neural network trained on the full dataset. The difference in accuracy between the baselines can be explained by our consolidation strategy gradually reducing the number of weights able to switch, therefore acting as a learning rate decay (the mean accuracy achieved by a binarized neural network with trained with a learning rate decay on all the data is , equivalent to the metaplastic baseline in Fig. 4(a)).
In order to see if the advantage provided by metaplastic synapses holds for convolutional networks and harder tasks, we then consider the CIFAR-10 dataset, with a binarized version of a Visual Geometry Group (VGG) convolutional neural network (see Methods). CIFAR-10 is split into 20 sub datasets of 2,500 examples. The test accuracy curve of the metaplastic binarized neural network exhibits a gap with baseline accuracies smaller than the non-metaplastic one. Our metaplastic binarized neural network can thus gain new knowledge from new data without forgetting previously learned unavailable data.
Because our consolidation strategy does not involve changing the loss function and the batch normalization settings are common across all subsets of data, the metaplastic binarized neural network gains new knowledge with each subset of data without any information about subsets boundaries. This feature is especially useful for embedded applications, and is not currently possible in alternative approaches of the literature to address catastrophic forgetting.
We now provide a mathematical interpretation for the hidden weights of binarized neural networks: we show in archetypal situations that the larger a hidden weight gets while learning a given task, the bigger the loss increase upon flipping the sign of the associated binary weight, and consequently the more important they are with respect to this task. For this purpose, we define a quadratic binary task, an analytically tractable and convex counterpart of a binarized neural network optimization task. This task, defined formally in Supplementary Note 3, consists in finding the global optimum on a landscape featuring a uniform (Hessian) curvature. The gradient used for the optimization is evaluated using only the sign of the parameters (Fig. 5
(a)), in the same way that binarized neural networks employ only the sign of hidden weights for computing gradients during training. In Supplementary Note 3, we demonstrate theoretically that throughout optimization on the quadratic binary task, if the uniform norm of the weight optimum vector is greater than one, the hidden weights vector diverges. Fig.
5(a) shows an example in two dimensions where such a divergence is seen. This situation is reminiscent of the training of binarized neural networks on practical tasks, where the divergence of some hidden weights is observed. In the particular case of a diagonal Hessian curvature, a correspondence exists between diverging hidden weights and components of the weight optimum greater than one in absolute value. We can derive an explicit form for the asymptotic evolution of the diverging hidden weights while optimizing: the hidden weights diverge linearly: with a speed proportional to the curvature and the absolute magnitude of the global optimum (see Supplementary Note 3). Given this result, we can prove the following theorem (see Supplementary Note 3):Let optimize the quadratic binary task with optimum weight and curvature matrix , using the optimization scheme: . We assume equal to with , . Then, if , the variation of loss resulting from flipping the sign of is:
(1) |
This theorem states that the increase in the loss induced by flipping the sign of a diverging hidden weight is asymptotically proportional to the sum of the curvature and a term proportional to the hidden weight. Hence the correlation between high valued hidden weights and important binary weights.
Interestingly, this interpretation, established rigorously in the case of a diagonal Hessian curvature, may generalize to non-diagonal Hessian cases. Fig. 5 for example illustrates the correspondence between hidden weights and high impact on the loss by sign change on a quadratic binary task (Fig. 5(b)) with a 500-dimensional non-diagonal Hessian matrix (see Methods for the generation procedure). Fig. 5(c,d,e) finally shows that this correspondence extends to a practical binarized neural network situation, trained on MNIST. In this case, the cost variation upon switching binary weights signs increases monotonically with the magnitudes of the hidden weights (see Methods for implementation details). These results provide an interpretation as to why hidden weights can be thought of as local importance factors useful for continual learning applications.
Addressing catastrophic forgetting with ideas from both neuroscience and machine learning has led us to find an artificial neural network with richer synapses behaviours that can perform continual learning without requiring an overhead computation of task-related importance factors. The continual learning capability of metaplastic binarized neural networks emerges from its intrinsic design, which is in stark contrast with other consolidation strategies
[Kirkpatrick2016, Zenke2017, aljundi2018memory]. The resulting model is more autonomous because the optimized loss function is the same across all tasks. Metaplastic synapses enable binarized neural networks to learn several tasks sequentially similarly to related works, but more importantly, our approach takes the first steps beyond a more fundamental limitation of deep learning, namely the need for a full dataset to learn a given task. A single autonomous model able to learn a task from small amounts of data while still gaining knowledge, approaching to some extent the way the brain acquires new information, paves the way for widespread use of embedded hardware for which it is impossible to store large datasets.Additionally, taking inspiration from the metaplastic behaviour of actual synapses of the brain resulted in a strategy where the consolidation is local in space and time. This makes this approach particularly suited for artificial intelligence dedicated hardware and neuromorphic computing approaches, which can save considerable energy by employing circuit architectures optimized for the topology of neural network models, and therefore limiting data movements
[editorial_big_2018]. The fact that our approach builds on synapses with rich behaviour also resonates with the progress of nanotechnologies, which can provide compact and energy-efficient electronic devices able to mimic neuroscience-inspired models [ambrogio2018equivalent, boyn2017learning, romera2018vowel, torrejon2017neuromorphic]. This also evidences the benefit of taking inspiration from biology with regards to purely mathematically-motivated approaches: they tend to be naturally compatible with the constraints of hardware developments and can be amenable for the development of energy-efficient artificial intelligence.In conclusion, we have shown that the hidden weights involved in the training of binarized neural networks are excellent candidates as metaplastic variables that can be efficiently leveraged for continual learning. We have implemented long term memory into binarized neural networks by modifying the hidden weight update of synapses. Our work highlights that binarized neural networks might be more than a low precision version of deep neural networks, as well as the potential benefits of the synergy between neurosciences and machine learning research, which for instance aims to convey long term memory to artificial neural networks. We have also mathematically justified our technique in a tractable quadratic binary problem. Our method allows for online synaptic consolidation directly from model behaviour, which is important for neuromorphic dedicated hardware, and is also useful for a variety of settings subject to catastrophic forgetting.
This work was supported by European Research Council Starting Grant NANOINFER (reference: 715872).The authors would like to thank L. Herrera-Diez, J. Thiele, G. Hocquet, P. Bessière, T. Dalgaty and J. Grollier for discussion and invaluable feedback on the manuscript.
AL developed the Pytorch code used in this project and performed all subsequent simulations. AL and ME carried the mathematical analysis of the Mathematical Interpretation section. TH provided the initial idea for the project, and an initial Numpy version of the code. Authors ME and TH contributed equally to the project. DQ directed the work. All authors participated in data analysis, discussed the results and co-edited the manuscript.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
The binarized neural networks studied in this work are designed and trained following the principles introduced in [Courbariaux2016]
- specific implementation details are provided in Supplementary Note 2. These networks consist of binarized layers where both weight values and neuron activations assume binary values meaning
. Binarized neural networks can achieve high accuracy on vision tasks [rastegari2016xnor, lin2017towards], provided that the number of neurons is increased with regards to real neural networks. Binarized neural networks are especially promising for AI hardware because unlike conventional deep networks which rely on costly matrix-vector multiplications, these operations for binarized neural networks can be done in hardware with XNOR logic gates and pop-count operations, reducing the power consumption by several orders of magnitude [hirtzlin2019digital].In this work, we propose an adaptation of the conventional binarized neural network training technique to provide binarized neural networks with metaplastic synapses. We introduce the function to provide an asymmetry, at equivalent gradient value and for a given weight, between updates towards zero hidden value and away from zero. Alg. 1 describes our optimization update rule and the unmodified version of the update rule is recovered when due to condition (2) satisfied by . is defined such that:
(2) | |||
(3) | |||
(4) | |||
(5) |
Conditions (3) and (4) ensure that near-zero real values, the weights are free to switch in order to learn. Condition (5) ensures that the farther from zero a real value is, the more difficult it is to make the corresponding weight switch back. In all the experiments of this paper, we use :
(6) |
The parameter controls how fast binary weights are consolidated (Fig. 1(c)). The specific choice of is made to have a variety of plasticity over large ranges of time steps (iteration steps) with an exponential dependence as in [Fusi2005]
. Specific values of the hyperparameters can be found in Supplementary Note 2.
A permuted version of the MNIST dataset consists of a fixed spatial permutation of pixels applied to each example of the dataset. We also train a full precision (32-bits floating point) version of our network with the same architecture for comparison, but with activation function instead of . The learned parameters in batch normalization are not binary and therefore cannot be consolidated by our metaplastic strategy. Therefore, in our experiments, the binarized and full precision neural networks have task-specific batch normalization parameters in order to isolate the effect of weight consolidation on previous tasks test accuracies.
The elastic weight consolidation control is trained with parameter . The random consolidation presented in Tab. 1 consists in computing the same importance factors as elastic weight consolidation but then randomly shuffling the importance factors of the synapses.
For Fashion-MNIST experiments, we use a metaplastic binarized neural network of two 1,024 units hidden layers. The dataset is split into 60 subsets of 1,000 examples each, and each subset is learned for 20 epochs. (All classes are represented in each subset.)
For CIFAR-10 experiments, we use a binary version of VGG-7 similarly to [Courbariaux2016]
, with six convolution layers of 128-128-256-256-512-512 filters and kernel sizes of 3. Dropout with probability
is used in the last two fully connected layers of 2,048 units. Data augmentation is used within each subset with random crop and random rotation.Two major differences between the quadratic binary task and the binarized neural network are the dependence on the training data and the relative contribution of each parameter which is lower in the case of the BNN than in the quadratic binary task. The procedure for generating Fig.5(c,d,e) has to be adapted accordingly. Bins of increasing normalised hidden weights are created, but instead of computing the cost variation for a single sign switch, a fixed amount of weights are switched within each bin so as to increase the contribution of the sign switch on the cost variation. The resulting cost variation is then normalised with respect to the number of switched weights. An average is done over several realizations of the hidden weights to be switched. Given the different sizes of the three layers, the amounts of switched weights per bins for each layer are respectively 1,000, 2,000, and 100.
To generate random positive symmetric definite matrices we first generate the diagonal matrix of eigenvalues
with a uniform or normal distribution of mean
and variance
and ensure that all eigen values are positive. We then use the subgroup algorithm described in [diaconis1987subgroup] to generate a random rotation in dimension . We then compute .Throughout this work, all simulations are performed using Pytorch 1.1.0. The source codes used in this work are freely available online in the Github repository:
All used datasets (MNIST, Fashion-MNIST, CIFAR-10) are available in the public domain.
The optimization is performed using Adaptive Moment Estimation (Adam) algorithm
[Kingma2014]. As the sign function is not differentiable in zero and the derivative is zero on , during error backpropagation the derivative of hardtanh function is used as a replacement for the derivative of the function. The activation function is the sign function except for the output layer. The input neurons are not binarized. We use batch normalization [Ioffe2015] at all layers as detailed in Alg. 1. The following derivation for layer ,shows that because the function is invariant by any multiplicative constant in the input, the only task dependent parameters we need to store for an inference hardware chip is the term between square brackets, along with the sign of . The amount of task dependent parameters scales as the number of neurons and is order of magnitudes smaller than the number of synapses.
Adam optimizer updates the hidden weight with loss gradients computed using binary weights only. We use a small weight decay of in the Adam optimizer to make zero floating values more stable. However, consolidated weights are not subject to weight decay, as we implement weight decay as a modification of the loss gradient, which is gradually suppressed by .
pMNISTs | |||
Network | Binarized meta | Binarized EWC | Full precision |
Layers | 784-4096-4096-10 | 784-4096-4096-10 | 784-4096-4096-10 |
Learning rate | 0.005 | 0.005 | 0.005 |
Minibatch size | 100 | 100 | 100 |
Epochs/task | 40 | 40 | 40 |
1.5 | 0.0 | 1.5 | |
0.0 | 5,000 | 0.0 | |
Weight decay | 1e-7 | 1e-7 | 1e-7 |
Initialization | Uniform width = 0.1 | Uniform width = 0.1 | Uniform width = 0.1 |
FMNIST - MNIST | |
Network | Binarized meta |
Layers | 784-4096-4096-10 |
Learning rate | 0.005 |
Minibatch size | 100 |
Epochs/task | 50 |
1.5 | |
Weight decay | 1e-8 |
Initialization | Uniform width = 0.1 |
Stream FMNIST | Stream CIFAR-10 | |
Network | Binarized meta | Binarized meta |
Layers | 784-1024-1024-10 | VGG-7 |
Sub Parts | 60 | 20 |
Learning rate | 0.005 | 0.0001 |
Minibatch size | 100 | 64 |
Epochs/subset | 20 | 200 |
2.5 | 13.0 | |
Weight decay | 1e-7 | 0.0 |
Initialization | Uniform width = 0.1 | Gauss width = 0.007 |
The batch normalization layers parameters were not learned for the Fashion MNIST experiment whereas they were learned for the CIFAR-10 experiment.
The batch normalization parameters are set to , for the Fashion MNIST experiment. The performance of the BNN with learned batch normalization parameters was inferior, as batch normalization parameters appear to overfit to the subsets of data. In the CIFAR-10 experiment the performance was higher with learned batch normalization parameters. The architecture of VGG-7 network consists of 6 convolutional layers of
sized kernels with kernel number per layer following the sequence 128-128-256-256-512-512. The classifier consists of two hidden layers of 2048-1024 hidden units. Dropout was used in the classifier with value
.Consider the loss function:
(7) |
with a symmetric definite positive matrix . Gradients are given by . We assume the following optimization scheme:
(8) |
where sign returns the sign of a vector component-wise.
Let optimize a quadratic binary task according to the dynamics . Let be the unit ball for the infinite norm and its closure. Then:
(9) | ||||
(10) |
We first prove Eq. (10). Let us assume that so that there exists at least one component such that . Since is symmetric definite positive, it is invertible. Taking the euclidian scalar product between and the update yields:
where we have used at the fourth equality that is also symmetric. Since , the sign of is constant (and ), so the component of along is expected to diverge. More precisely, let us assume so that and:
(11) |
Summing Eq. (11) from time step 0 to yields:
(12) |
showing that . Consequently there exists such that and therefore . Similarly if , we show that:
(13) |
giving the same conclusion as above.
We now prove Eq. (9). Let us assume that , i.e. . We have:
so that :
(14) |
We want to show that if is large enough in norm , Eq. (14) will be met. First note that, because the dimension is finite there exist two constants and such that ,
and also that:
Then, by triangular inequality:
Denoting and the eigenbasis of H and their associated eigenvalues, we have by Cauchy Schwarz inequality:
so that:
(15) |
Thus the right hand side of Eq. 14 is bounded. Also note that:
So far we have shown that the left hand side of Eq.14 is lower bounded by a constant () times the infinite norm of , while the right hand side is bounded. Therefore to ensure Eq. (14) it suffices that:
And thus to ensure Eq. (14) it suffices that:
Denoting , we can conclude that . And because the update is bounded in norm , an absolute upper bound of is :
Thus we have proven that
∎
Let optimize a quadratic binary task according to the dynamics and assume . Then:
(16) |
If , the dynamics of defined in Eq. (8) simply rewrites component-wise:
(17) |
By Lemma 1, components such that are bounded.
For components where , has the sign of since Eq. (17) rewrites:
(18) |
so that necessarily ends up having the same sign as , hence there exists such that :
(19) |
By definition of , and have opposite sign before so that:
(20) |
Therefore, summing Eq. (17) between 0 and yields :
(21) |
∎
Let optimize a quadratic binary task according to the dynamics and assume . Then, for any component such that , the variation of loss resulting from flipping is:
(22) |
Proof of Theorem. 22