1 Introduction
In this paper, we study the level of parameter sharing within the conditional computation framework Bengio2013 ; Cho2014 ; Bengio2015 ; Shazeer2017 . The key idea of conditional computation is to make the parameters of the neural network a function of the input denoted as . The computation defined by is conditioned on through the function :
(1) 
The conventional neural network has . The network parameter is independent of , and therefore it is shared globally by all examples. To reduce the level of parameter sharing thus reducing forgetting, we need to choose other than constant functions.
A naïve choice of to prevent forgetting is a lookup table that keeps different parameters for every unique input :
(2) 
With this onetoone mapping between and , there is zero parameter sharing between examples, and thus no forgetting will happen when learning new examples. However, this choice is not very interesting as it results in a local nonparametric , losing the generalization property of neural networks.
Given neural networks and local nonparametric models as two extremes in the conditional computation framework, we seek to strike a balance between those two extremes in the hope that the resulting model can take the best of both worlds, i.e. having good generalization like neural networks while suffering less from catastrophic forgetting like local nonparametric models.
The contributions of this work are:

We analyze parameter sharing and correspondingly the interfered examples when learning new knowledge under the conditional computation framework.

Based on the analysis of interfered example, we propose conditional rehearsal to rehearse only the interfered examples, which is more efficient than random rehearsal Robins1995 .

We introduce clipped maxout, which has a smaller set of interfered examples when learning a new example, compared to maxout. Together with conditional rehearsal, it is capable of continual learning.

We also evaluate our proposed method in a new setup of MNIST, named MNISTol, where a single example is used for training at a time, and the distribution of the received example shifts over time.
2 Conditional Computation with Partial Parameter Sharing
2.1 ManytoOne Mapping between and
Consider first how one could share parameters within groups of examples using
(3) 
We would map the examples to a group id through the grouping function and then associate unique parameters to each group through a lookup table. In contrast to the onetoone mapping defined in Eqn. 2, in Eqn. 3 we define a manytoone mapping between examples and parameters. Parameters are shared between examples mapping to the same group. To reduce complication, we assume that the grouping function is predefined so that the parameter sharing relationships between examples are fixed. The situation where changes will be revisited in Sec. 2.2.1. Under this setting, learning of a new example interferes only with historical examples from the same group of the new example.
Additionally, it is also computationally more efficient to do rehearsal since we only need to rehearse over a limited number of examples that are interfered, i.e. those in the same group of the new example. We term this as conditional rehearsal because the rehearsal set is conditioned on the new example being learned. Note that in this work we assume all historical examples are available, and the examples can be stored in a lookup table indexed by for fast retrieval.
2.2 ManytoMany Mapping between and
The manytoone mapping assumes that examples belonging to different groups cannot share parameters, which may be too restrictive and loses the combinatorial advantage of sharing enjoyed by deep neural networks (montufar2014number, ). We can easily extend Eqn. 3 to a manytomany mapping by assigning more than one group id to the input :
(4) 
2.2.1 Maxout Network as Conditional Computation
One empirical argument is that we can reduce forgetting if we sparsify the update to the weights by means of node sharpening French1991 , dropout Goodfellow2013 , maxout goodfellow2013maxout or compete to compute Srivastava2013 . Although empirically they do exhibit a slower forgetting property, the results are still far from satisfactory Goodfellow2013 . Here we analyze the forgetting property of maxout networks under the conditional computation framework and introduce a few modifications to further reduce forgetting.
A maxout unit implements the following function:
(5) 
where and are the parameters so that there are outputs and each output is the maximum over neurons. It can be transformed into the conditional computation form:
(6) 
(7) 
We use to represent the set of all historical examples, and for the interfered examples. Take Eqn. 6 alone, if is predefined and fixed, learning only interferes with examples in denoted by . However, the assumption that is fixed does not hold because itself uses and as parameters. When and are updated, and thus is potentially massively modified for any , including when . This can be demonstrated with the 1D case in Fig. 1. For simplicity, we omit the index here and for the rest of the paper when discussing only one maxout unit with scalar output. In this figure, we show that although , its output value is still changed when linear neuron gets updated. Note that if the change to linear neuron
is small enough, there is probability that
does not change. This could be the reason why empirically maxout slightly mitigates the forgetting as shown in Goodfellow2013 , but there is no theoretical guarantee.Therefore, in the worst case of linear maxout, the interfered set equals . To make strictly smaller than , we introduce a few modifications to maxout in the next section.
2.2.2 Clipped Maxout and Conditional Rehearsal
We clip the linear output before maxout to a constant with . can be any fixed value because we can rely on the bias term to control the magnitude of the output. We use for clipping in this paper. The clipped maxout is described with the following function:
(8)  
(9) 
Conditioned on the new example , the historical examples set can be divided into 3 disjoint sets based on the value of :

.


We show a 2D graphical depiction of the 3 sets in Fig. 2. is in pink, in yellow and in light blue. The dashed red line stands for the linear neuron selected by , it will be updated when we perform one step of gradient descent on . When the update happens, only examples falling in and will be interfered. Examples in will not be interfered because they are clipped on at least one neuron that are not updated. Refer to Sec. G of the Appendix for a formal proof. One good property of the clipped maxout is that the interfered examples falls within the convex set enclosed by the linear neurons excluding the neuron being updated. This convex set could potentially be small if enough neurons are maxed out.
Given that training on the new example only interferes with , we can utilize conditional rehearsal to specifically rehearse these examples when learning new knowledge. If the model has enough capacity to learn new knowledge and at the same time preserve the output for the rehearsed examples, it is guaranteed that there will be no forgetting of historical examples. The effectiveness of conditional rehearsal on clipped maxout units will be verified in the experiments section.
2.2.3 Minimally Clipped Minout
To make the activation value positive rather than negative in the convex set , we adopt the mirror negative of the maximally clipped maxout, which is the minimally clipped minout. The definition and the motivation of minimally clipped minout is detailed in Sec. I of the Appendix.
3 Experiments
Data — We experiment on the MNIST dataset in this work.
Setups — Disjoint MNIST Srivastava2013 and Permuted MNIST Goodfellow2013 ; Kirkpatrick2017 are the most commonly used settings. Disjoint MNIST splits the dataset into multiple subsets which have disjoint labels. Permuted MNIST creates new datasets from MNIST by permuting the pixels. For these two setups, the algorithm is trained on one subset at a time with i.i.d. assumption within each subset.
In this work, however, we study continual learning with an online nonstationary setting where a single example at a time is seen before making an update, and the distribution of the received example shifts over time. The goal is to fit optimally to the already seen examples at any time point of the training, which can be measured by the accuracy on the test set throughout the training procedure. Accordingly, we propose a new setup for MNIST dataset named as MNIST with ordered labels (MNISTol). As the name implies, the training images are arranged by their associated labels in an ascending (or descending) order. For example, during training, images with label are received first, and those with label are received last. Ordering by labels removes the assumption that each example from the data stream is drawn i.i.d. from the whole training set. It can be seen from Fig. 7
in the Appendix that learning with stochastic gradient descent completely fails on MNISTol.
Due to current suboptimal implementation, we experiment with a subset of MNISTol by randomly taking 100 examples from each class.
Model configuration — Our model is a single layer minout network with 10 minout units each corresponding to one label. Each minout unit has 50 linear neurons. We apply sigmoidactivation function on the linear neurons before minout so that they are clipped to . The output of each minout unit is directly used as the probability of each label and trained by a per label sigmoid cross entropy loss. Activation value smaller than is seen as clipped to when deciding the interfered set.
Baselines — For baseline, we compare to the same model trained on MNISTol without rehearsal and with rehearsal on randomly selected historical examples. The number of randomly selected examples are set to match the number conditionally rehearsed examples in the studied method, which is 100 according to Sec. 3.1.
Training — Training happens one example after another with an additional rehearsal loss and corresponding updates. For both this method and the baselines, the training of an example on one maxout unit is stopped as soon as the loss on this unit is smaller than .
3.1 Number of Examples Rehearsed
Under the configuration of our model, each minout unit encloses one class of the training data in the convex set , which suggests that the theoretical number of rehearsed examples should be around 100. To verify this, we plot the average number of examples that are rehearsed for each minout unit during training in Fig. 3. It can be seen that the number of rehearsed data throughout training fluctuates around 100, which is consistent with our expectations. More discussion on the number of rehearsed examples can be found in Appendix K
3.2 Accuracy and Forgetting Behavior of the Proposed Method
We test the accuracy of both the training set and test set after learning of every example, and plot the training/testing accuracy in Fig. 4. We can see that both training and testing accuracy monotonically increase throughout training for clipped minout with conditional rehearsal. Accuracy on the training set reached 100% at the end of training, which means no forgetting is happening.
The no rehearsal baseline fails to learn as expected. However, it seems that the random rehearsal baseline is doing as good as the conditional rehearsal. We argue that this is because the MNIST dataset has only a few modes and that 100 randomly selected examples would contain enough information of the whole dataset. For a more complex dataset where the number of modes exceeds the number of rehearsed examples, conditional rehearsal would be advantageous because it is more selective and thus more efficient. We verify this by further reducing the training set to 10 examples from each class. Correspondingly the number of rehearsal examples is reduced to 10. It is harder for 10 examples to contain sufficient information of the whole dataset. As is shown in Fig. 5, conditional rehearsal outperforms random rehearsal by a big margin.
4 Future Directions
We will focus on two directions in the future. First, we aim to develop a deep version of the proposed clipped maxout network. Second, we plan to design connectionist approaches for storing historical data.
References
 [1] Emmanuel Bengio, PierreLuc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.
 [2] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 [3] Kyunghyun Cho and Yoshua Bengio. Exponentially increasing the capacitytocomputation ratio for conditional computation in deep learning. arXiv preprint arXiv:1406.7362, 2014.

[4]
Robert M French.
Using semidistributed representations to overcome catastrophic forgetting in connectionist networks.
In Proceedings of the 13th annual cognitive science society conference, pages 173–178. Erlbaum, 1991.  [5] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An Empirical Investigation of Catastrophic Forgetting in GradientBased Neural Networks. ArXiv eprints, December 2013.

[6]
Ian J Goodfellow, David WardeFarley, Mehdi Mirza, Aaron Courville, and Yoshua
Bengio.
Maxout networks.
In
Proceedings of the 30th International Conference on International Conference on Machine LearningVolume 28
, pages III–1319. JMLR. org, 2013.  [7] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka GrabskaBarwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, page 201611835, 2017.
 [8] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932, 2014.
 [9] Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995.
 [10] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparselygated mixtureofexperts layer. arXiv preprint arXiv:1701.06538, 2017.
 [11] Rupesh K Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, and Jürgen Schmidhuber. Compete to compute. In Advances in neural information processing systems, pages 2310–2318, 2013.
Appendix A Onestep update on does not interfere with examples in
Denote the parameter after updating on as and , where and if .
Theorem 1.
Let denote the function after the update, then for .
Proof. By definition of , for there exists where .
For any ,
Therefore, .
Let denote the after the update,
Therefore, , which completes the proof.
Appendix B Bookkeeping for Conditional Rehearsal
To efficiently locate the interfered historical examples when training a new example, we use a keyvalue store to keep the historical data. The keys are indice of linear neurons, and an example is stored under a key if the corresponding neuron is clipped at this example. At the same time a counter is associated with each example to count how many linear neurons are clipped. For a historical example, if no neurons are clipped, it falls in ; if only the neuron selected by the new example is clipped, it falls in ; otherwise it falls in . After the weight update, the information in the table is updated accordingly.
Appendix C Minimally Clipped Minout
The minimally clipped minout unit is defined as follows,
(10)  
(11) 
We can rewrite Eqn. 8 and Eqn. 9 as the negative of minout, with and as the parameters.
(12)  
(13) 
In this work we adopt minimally clipped minout instead of maximally clipped maxout because it is activated (larger than 0) rather than deactivated (smaller than 0) in the convex set . This aligns better with human instinct. When we think of the maxout unit as a detector of some property of the input data, we would like the unit be activated when the property is present.
Appendix D Relationship between Minout and the Gating mechanism for Conditional Computation
The original conditional computation paper and followup works introduce binary gating neurons to turn on/off computing neurons conditioned on inputs [Serra2018]. In practice, sigmoid
activation functions are usually used in place of the binarization for the ease of training. Assuming both the computing neuron and gating neuron use
sigmoid activation functions, it can be written as:(14) 
Note that the computing neuron and gating neuron are indistinguishable and can be swapped. We can generalize this into . We can see that if there exists so that . And only when for .
For clipped Minout:
(15) 
It has a similar behavior, only when all of .
Appendix E Number of Rehearsed Examples
One can consider the maxout unit as a detector of some property of the input. Inputs that have this property will fall within the convex set , we call them positive examples. Inputs that do not have this property will fall outside of the convex set, and we call them negative examples.
For a single maxout unit, the number of the rehearsed examples depends on the size of and . can be small if most negative examples are clipped at more than one neuron. The size of depends on how many historical examples activates this maxout unit. In this paper, since each maxout unit corresponds directly to one of the categories, the size of is approximately one tenth (for ten categories in MNIST) of the total training examples.
In the future, we can have more maxout units in the hidden layers in the future with a deep version of this idea. And we can study the relationship between the number of rehearsed examples and the number of maxout units.
Appendix F Stochastic Gradient Descent fails on MNISTol
A 2layer multilayer perceptron (
) with ReLU activations and a softmax loss is constructed and trained with SGD on MNISTol.
Fig. 7 shows that naïvely training with SGD fails utterly on MNISTol task, whereas our proposed method does not suffer from catastrophic forgetting.
Appendix G Onestep update on does not interfere with examples in
Denote the parameter after updating on as and , where and if .
Theorem 2.
Let denote the function after the update, then for .
Proof. By definition of , for there exists where .
For any ,
Therefore, .
Let denote the after the update,
Therefore, , which completes the proof.
Appendix H Bookkeeping for Conditional Rehearsal
To efficiently locate the interfered historical examples when training a new example, we use a keyvalue store to keep the historical data. The keys are indice of linear neurons, and an example is stored under a key if the corresponding neuron is clipped at this example. At the same time a counter is associated with each example to count how many linear neurons are clipped. For a historical example, if no neurons are clipped, it falls in ; if only the neuron selected by the new example is clipped, it falls in ; otherwise it falls in . After the weight update, the information in the table is updated accordingly.
Appendix I Minimally Clipped Minout
The minimally clipped minout unit is defined as follows,
(16)  
(17) 
We can rewrite Eqn. 8 and Eqn. 9 as the negative of minout, with and as the parameters.
(18)  
(19) 
In this work we adopt minimally clipped minout instead of maximally clipped maxout because it is activated (larger than 0) rather than deactivated (smaller than 0) in the convex set . This aligns better with human instinct. When we think of the maxout unit as a detector of some property of the input data, we would like the unit be activated when the property is present.
Appendix J Relationship between Minout and the Gating Mechanism for Conditional Computation
The original conditional computation paper and followup works introduce binary gating neurons to turn on/off computing neurons conditioned on inputs [2]. In practice, sigmoid activation functions are usually used in place of the binarization for the ease of training. Assuming both the computing neuron and gating neuron use sigmoid activation functions, it can be written as:
(20) 
Note that the computing neuron and gating neuron are indistinguishable and can be swapped. We can generalize this into . We can see that if there exists so that . And only when for .
For clipped Minout:
(21) 
It has a similar behavior, only when all of .
In fact, Eqn. 20 and 21 can be seen as AND functions, whose output is nonzero only when all the neurons are activated. The AND function is activated only when all conditions are satisfied. This means that if the minout function is properly trained only very specific examples will fall in , making a small set for rehearsal.
Appendix K Number of Rehearsed Examples
One can consider the maxout unit as a detector of some property of the input. Inputs that have this property will fall within the convex set , we call them positive examples. Inputs that do not have this property will fall outside of the convex set, and we call them negative examples.
For a single maxout unit, the number of the rehearsed examples depends on the size of and . can be small if most negative examples are clipped at more than one neuron. The size of depends on how many historical examples activates this maxout unit. In this paper, since each maxout unit corresponds directly to one of the categories, the size of is approximately one tenth (for ten categories in MNIST) of the total training examples.
We can have more maxout units in the hidden layers in the future with a deep version of this idea. And we can study the relationship between the number of rehearsed examples and the number of maxout units.
Appendix L Stochastic Gradient Descent Fails on MNISTol
A 2layer multilayer perceptron () with ReLU activations and a softmax loss is constructed and trained with SGD on MNISTol. It is compared against online MNIST with i.i.d. assumption, i.e. SGD with as the minibatch size.
Fig. 7 shows that SGD fails utterly on MNISTol where i.i.d. assumption is broken due to the ordering.
Comments
There are no comments yet.