Conditional Computation for Continual Learning

06/16/2019 ∙ by Min Lin, et al. ∙ 2

Catastrophic forgetting of connectionist neural networks is caused by the global sharing of parameters among all training examples. In this study, we analyze parameter sharing under the conditional computation framework where the parameters of a neural network are conditioned on each input example. At one extreme, if each input example uses a disjoint set of parameters, there is no sharing of parameters thus no catastrophic forgetting. At the other extreme, if the parameters are the same for every example, it reduces to the conventional neural network. We then introduce a clipped version of maxout networks which lies in the middle, i.e. parameters are shared partially among examples. Based on the parameter sharing analysis, we can locate a limited set of examples that are interfered when learning a new example. We propose to perform rehearsal on this set to prevent forgetting, which is termed as conditional rehearsal. Finally, we demonstrate the effectiveness of the proposed method in an online non-stationary setup, where updates are made after each new example and the distribution of the received example shifts over time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we study the level of parameter sharing within the conditional computation framework Bengio2013 ; Cho2014 ; Bengio2015 ; Shazeer2017 . The key idea of conditional computation is to make the parameters of the neural network a function of the input denoted as . The computation defined by is conditioned on through the function :

(1)

The conventional neural network has . The network parameter is independent of , and therefore it is shared globally by all examples. To reduce the level of parameter sharing thus reducing forgetting, we need to choose other than constant functions.

A naïve choice of to prevent forgetting is a look-up table that keeps different parameters for every unique input :

(2)

With this one-to-one mapping between and , there is zero parameter sharing between examples, and thus no forgetting will happen when learning new examples. However, this choice is not very interesting as it results in a local non-parametric , losing the generalization property of neural networks.

Given neural networks and local non-parametric models as two extremes in the conditional computation framework, we seek to strike a balance between those two extremes in the hope that the resulting model can take the best of both worlds, i.e. having good generalization like neural networks while suffering less from catastrophic forgetting like local non-parametric models.

The contributions of this work are:

  • We analyze parameter sharing and correspondingly the interfered examples when learning new knowledge under the conditional computation framework.

  • Based on the analysis of interfered example, we propose conditional rehearsal to rehearse only the interfered examples, which is more efficient than random rehearsal Robins1995 .

  • We introduce clipped maxout, which has a smaller set of interfered examples when learning a new example, compared to maxout. Together with conditional rehearsal, it is capable of continual learning.

  • We also evaluate our proposed method in a new setup of MNIST, named MNIST-ol, where a single example is used for training at a time, and the distribution of the received example shifts over time.

2 Conditional Computation with Partial Parameter Sharing

2.1 Many-to-One Mapping between and

Consider first how one could share parameters within groups of examples using

(3)

We would map the examples to a group id through the grouping function and then associate unique parameters to each group through a look-up table. In contrast to the one-to-one mapping defined in Eqn. 2, in Eqn. 3 we define a many-to-one mapping between examples and parameters. Parameters are shared between examples mapping to the same group. To reduce complication, we assume that the grouping function is pre-defined so that the parameter sharing relationships between examples are fixed. The situation where changes will be revisited in Sec. 2.2.1. Under this setting, learning of a new example interferes only with historical examples from the same group of the new example.

Additionally, it is also computationally more efficient to do rehearsal since we only need to rehearse over a limited number of examples that are interfered, i.e. those in the same group of the new example. We term this as conditional rehearsal because the rehearsal set is conditioned on the new example being learned. Note that in this work we assume all historical examples are available, and the examples can be stored in a look-up table indexed by for fast retrieval.

2.2 Many-to-Many Mapping between and

The many-to-one mapping assumes that examples belonging to different groups cannot share parameters, which may be too restrictive and loses the combinatorial advantage of sharing enjoyed by deep neural networks (montufar2014number, ). We can easily extend Eqn. 3 to a many-to-many mapping by assigning more than one group id to the input :

(4)

2.2.1 Maxout Network as Conditional Computation

One empirical argument is that we can reduce forgetting if we sparsify the update to the weights by means of node sharpening French1991 , dropout Goodfellow2013 , maxout goodfellow2013maxout or compete to compute Srivastava2013 . Although empirically they do exhibit a slower forgetting property, the results are still far from satisfactory Goodfellow2013 . Here we analyze the forgetting property of maxout networks under the conditional computation framework and introduce a few modifications to further reduce forgetting.

A maxout unit implements the following function:

(5)

where and are the parameters so that there are outputs and each output is the maximum over neurons. It can be transformed into the conditional computation form:

(6)
(7)

We use to represent the set of all historical examples, and for the interfered examples. Take Eqn. 6 alone, if is pre-defined and fixed, learning only interferes with examples in denoted by . However, the assumption that is fixed does not hold because itself uses and as parameters. When and are updated, and thus is potentially massively modified for any , including when . This can be demonstrated with the 1D case in Fig. 1. For simplicity, we omit the index here and for the rest of the paper when discussing only one maxout unit with scalar output. In this figure, we show that although , its output value is still changed when linear neuron gets updated. Note that if the change to linear neuron

is small enough, there is probability that

does not change. This could be the reason why empirically maxout slightly mitigates the forgetting as shown in Goodfellow2013 , but there is no theoretical guarantee.

Figure 1: 1D demonstration of the change when parameter gets updated. The left figure shows and . Learning of pushes line up. The right figure shows that the update interferes not only with but also , i.e. after the update. Figure 2: 2D schematic about how learning on a new example interferes with historical examples for clipped maxout.

Therefore, in the worst case of linear maxout, the interfered set equals . To make strictly smaller than , we introduce a few modifications to maxout in the next section.

2.2.2 Clipped Maxout and Conditional Rehearsal

We clip the linear output before maxout to a constant with . can be any fixed value because we can rely on the bias term to control the magnitude of the output. We use for clipping in this paper. The clipped maxout is described with the following function:

(8)
(9)

Conditioned on the new example , the historical examples set can be divided into 3 disjoint sets based on the value of :

  1. .

We show a 2D graphical depiction of the 3 sets in Fig. 2. is in pink, in yellow and in light blue. The dashed red line stands for the linear neuron selected by , it will be updated when we perform one step of gradient descent on . When the update happens, only examples falling in and will be interfered. Examples in will not be interfered because they are clipped on at least one neuron that are not updated. Refer to Sec. G of the Appendix for a formal proof. One good property of the clipped maxout is that the interfered examples falls within the convex set enclosed by the linear neurons excluding the neuron being updated. This convex set could potentially be small if enough neurons are maxed out.

Given that training on the new example only interferes with , we can utilize conditional rehearsal to specifically rehearse these examples when learning new knowledge. If the model has enough capacity to learn new knowledge and at the same time preserve the output for the rehearsed examples, it is guaranteed that there will be no forgetting of historical examples. The effectiveness of conditional rehearsal on clipped maxout units will be verified in the experiments section.

2.2.3 Minimally Clipped Minout

To make the activation value positive rather than negative in the convex set , we adopt the mirror negative of the maximally clipped maxout, which is the minimally clipped minout. The definition and the motivation of minimally clipped minout is detailed in Sec. I of the Appendix.

3 Experiments

Data — We experiment on the MNIST dataset in this work.

Setups — Disjoint MNIST Srivastava2013 and Permuted MNIST Goodfellow2013 ; Kirkpatrick2017 are the most commonly used settings. Disjoint MNIST splits the dataset into multiple subsets which have disjoint labels. Permuted MNIST creates new datasets from MNIST by permuting the pixels. For these two setups, the algorithm is trained on one subset at a time with i.i.d. assumption within each subset.

In this work, however, we study continual learning with an online non-stationary setting where a single example at a time is seen before making an update, and the distribution of the received example shifts over time. The goal is to fit optimally to the already seen examples at any time point of the training, which can be measured by the accuracy on the test set throughout the training procedure. Accordingly, we propose a new setup for MNIST dataset named as MNIST with ordered labels (MNIST-ol). As the name implies, the training images are arranged by their associated labels in an ascending (or descending) order. For example, during training, images with label are received first, and those with label are received last. Ordering by labels removes the assumption that each example from the data stream is drawn i.i.d. from the whole training set. It can be seen from Fig. 7

in the Appendix that learning with stochastic gradient descent completely fails on MNIST-ol.

Due to current suboptimal implementation, we experiment with a subset of MNIST-ol by randomly taking 100 examples from each class.

Model configuration — Our model is a single layer minout network with 10 minout units each corresponding to one label. Each minout unit has 50 linear neurons. We apply sigmoidactivation function on the linear neurons before minout so that they are clipped to . The output of each minout unit is directly used as the probability of each label and trained by a per label sigmoid cross entropy loss. Activation value smaller than is seen as clipped to when deciding the interfered set.

Baselines — For baseline, we compare to the same model trained on MNIST-ol without rehearsal and with rehearsal on randomly selected historical examples. The number of randomly selected examples are set to match the number conditionally rehearsed examples in the studied method, which is 100 according to Sec. 3.1.

Training — Training happens one example after another with an additional rehearsal loss and corresponding updates. For both this method and the baselines, the training of an example on one maxout unit is stopped as soon as the loss on this unit is smaller than .

3.1 Number of Examples Rehearsed

Under the configuration of our model, each minout unit encloses one class of the training data in the convex set , which suggests that the theoretical number of rehearsed examples should be around 100. To verify this, we plot the average number of examples that are rehearsed for each minout unit during training in Fig. 3. It can be seen that the number of rehearsed data throughout training fluctuates around 100, which is consistent with our expectations. More discussion on the number of rehearsed examples can be found in Appendix K

3.2 Accuracy and Forgetting Behavior of the Proposed Method

We test the accuracy of both the training set and test set after learning of every example, and plot the training/testing accuracy in Fig. 4. We can see that both training and testing accuracy monotonically increase throughout training for clipped minout with conditional rehearsal. Accuracy on the training set reached 100% at the end of training, which means no forgetting is happening.

The no rehearsal baseline fails to learn as expected. However, it seems that the random rehearsal baseline is doing as good as the conditional rehearsal. We argue that this is because the MNIST dataset has only a few modes and that 100 randomly selected examples would contain enough information of the whole dataset. For a more complex dataset where the number of modes exceeds the number of rehearsed examples, conditional rehearsal would be advantageous because it is more selective and thus more efficient. We verify this by further reducing the training set to 10 examples from each class. Correspondingly the number of rehearsal examples is reduced to 10. It is harder for 10 examples to contain sufficient information of the whole dataset. As is shown in Fig. 5, conditional rehearsal outperforms random rehearsal by a big margin.

Figure 3: Number of rehearsed examples throughout training.
Figure 4: Accuracy of conditional rehearsal vs random rehearsal vs no rehearsal. Figure 5: Accuracy of conditional rehearsal vs random rehearsal with a smaller training set.

4 Future Directions

We will focus on two directions in the future. First, we aim to develop a deep version of the proposed clipped maxout network. Second, we plan to design connectionist approaches for storing historical data.

References

Appendix A One-step update on does not interfere with examples in

Denote the parameter after updating on as and , where and if .

Theorem 1.

Let denote the function after the update, then for .

Proof. By definition of , for there exists where .

For any ,

Therefore, .

Let denote the after the update,

Therefore, , which completes the proof.

Appendix B Bookkeeping for Conditional Rehearsal

To efficiently locate the interfered historical examples when training a new example, we use a key-value store to keep the historical data. The keys are indice of linear neurons, and an example is stored under a key if the corresponding neuron is clipped at this example. At the same time a counter is associated with each example to count how many linear neurons are clipped. For a historical example, if no neurons are clipped, it falls in ; if only the neuron selected by the new example is clipped, it falls in ; otherwise it falls in . After the weight update, the information in the table is updated accordingly.

Appendix C Minimally Clipped Minout

The minimally clipped minout unit is defined as follows,

(10)
(11)

We can rewrite Eqn. 8 and Eqn. 9 as the negative of minout, with and as the parameters.

(12)
(13)

In this work we adopt minimally clipped minout instead of maximally clipped maxout because it is activated (larger than 0) rather than deactivated (smaller than 0) in the convex set . This aligns better with human instinct. When we think of the maxout unit as a detector of some property of the input data, we would like the unit be activated when the property is present.

Appendix D Relationship between Minout and the Gating mechanism for Conditional Computation

The original conditional computation paper and follow-up works introduce binary gating neurons to turn on/off computing neurons conditioned on inputs [Serra2018]. In practice, sigmoid

activation functions are usually used in place of the binarization for the ease of training. Assuming both the computing neuron and gating neuron use

sigmoid activation functions, it can be written as:

(14)

Note that the computing neuron and gating neuron are indistinguishable and can be swapped. We can generalize this into . We can see that if there exists so that . And only when for .

For clipped Minout:

(15)

It has a similar behavior, only when all of .

In fact, Eqn. 20 and 21 can be seen as AND functions, whose output is non-zero only when all the neurons are activated.

Appendix E Number of Rehearsed Examples

One can consider the maxout unit as a detector of some property of the input. Inputs that have this property will fall within the convex set , we call them positive examples. Inputs that do not have this property will fall outside of the convex set, and we call them negative examples.

For a single maxout unit, the number of the rehearsed examples depends on the size of and . can be small if most negative examples are clipped at more than one neuron. The size of depends on how many historical examples activates this maxout unit. In this paper, since each maxout unit corresponds directly to one of the categories, the size of is approximately one tenth (for ten categories in MNIST) of the total training examples.

In the future, we can have more maxout units in the hidden layers in the future with a deep version of this idea. And we can study the relationship between the number of rehearsed examples and the number of maxout units.

Appendix F Stochastic Gradient Descent fails on MNIST-ol

A 2-layer multilayer perceptron (

) with ReLU activations and a softmax loss is constructed and trained with SGD on MNIST-ol.

Figure 6: Training with Stochastic Gradient Descent fails on MNIST-ol.

Fig. 7 shows that naïvely training with SGD fails utterly on MNIST-ol task, whereas our proposed method does not suffer from catastrophic forgetting.

Appendix G One-step update on does not interfere with examples in

Denote the parameter after updating on as and , where and if .

Theorem 2.

Let denote the function after the update, then for .

Proof. By definition of , for there exists where .

For any ,

Therefore, .

Let denote the after the update,

Therefore, , which completes the proof.

Appendix H Bookkeeping for Conditional Rehearsal

To efficiently locate the interfered historical examples when training a new example, we use a key-value store to keep the historical data. The keys are indice of linear neurons, and an example is stored under a key if the corresponding neuron is clipped at this example. At the same time a counter is associated with each example to count how many linear neurons are clipped. For a historical example, if no neurons are clipped, it falls in ; if only the neuron selected by the new example is clipped, it falls in ; otherwise it falls in . After the weight update, the information in the table is updated accordingly.

Appendix I Minimally Clipped Minout

The minimally clipped minout unit is defined as follows,

(16)
(17)

We can rewrite Eqn. 8 and Eqn. 9 as the negative of minout, with and as the parameters.

(18)
(19)

In this work we adopt minimally clipped minout instead of maximally clipped maxout because it is activated (larger than 0) rather than deactivated (smaller than 0) in the convex set . This aligns better with human instinct. When we think of the maxout unit as a detector of some property of the input data, we would like the unit be activated when the property is present.

Appendix J Relationship between Minout and the Gating Mechanism for Conditional Computation

The original conditional computation paper and follow-up works introduce binary gating neurons to turn on/off computing neurons conditioned on inputs [2]. In practice, sigmoid activation functions are usually used in place of the binarization for the ease of training. Assuming both the computing neuron and gating neuron use sigmoid activation functions, it can be written as:

(20)

Note that the computing neuron and gating neuron are indistinguishable and can be swapped. We can generalize this into . We can see that if there exists so that . And only when for .

For clipped Minout:

(21)

It has a similar behavior, only when all of .

In fact, Eqn. 20 and 21 can be seen as AND functions, whose output is non-zero only when all the neurons are activated. The AND function is activated only when all conditions are satisfied. This means that if the minout function is properly trained only very specific examples will fall in , making a small set for rehearsal.

Appendix K Number of Rehearsed Examples

One can consider the maxout unit as a detector of some property of the input. Inputs that have this property will fall within the convex set , we call them positive examples. Inputs that do not have this property will fall outside of the convex set, and we call them negative examples.

For a single maxout unit, the number of the rehearsed examples depends on the size of and . can be small if most negative examples are clipped at more than one neuron. The size of depends on how many historical examples activates this maxout unit. In this paper, since each maxout unit corresponds directly to one of the categories, the size of is approximately one tenth (for ten categories in MNIST) of the total training examples.

We can have more maxout units in the hidden layers in the future with a deep version of this idea. And we can study the relationship between the number of rehearsed examples and the number of maxout units.

Appendix L Stochastic Gradient Descent Fails on MNIST-ol

A 2-layer multilayer perceptron () with ReLU activations and a softmax loss is constructed and trained with SGD on MNIST-ol. It is compared against online MNIST with i.i.d. assumption, i.e. SGD with as the mini-batch size.

Figure 7: Training with Stochastic Gradient Descent fails on MNIST-ol.

Fig. 7 shows that SGD fails utterly on MNIST-ol where i.i.d. assumption is broken due to the ordering.