Superposition of many models into one

02/14/2019 ∙ by Brian Cheung, et al. ∙ 12

We present a method for storing multiple models within a single set of parameters. Models can coexist in superposition and still be retrieved individually. In experiments with neural networks, we show that a surprisingly large number of models can be effectively stored within a single parameter instance. Furthermore, each of these models can undergo thousands of training steps without significantly interfering with other models within the superposition. This approach may be viewed as the online complement of compression: rather than reducing the size of a network after training, we make use of the unrealized capacity of a network during training.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Connectionist models have enjoyed a resurgence of interest in the artificial intelligence community. In particular,

neural network models have demonstrated remarkable performance on many tasks across the domains of vision, text and speech. But in practice, a separate model is dedicated to each of these tasks. Drawing inspiration from another classic connectionist model, the associative memory, we develop a framework for combining multiple distinct models into one superposition of models. By utilizing the dormant capacity already present in neural networks, it is possible to reliably store and retrieve models that are dynamically changing due to learning.

We propose a modification to a fundamental operation performed across many representation learning models: the linear transformation

. With a simple change to this nearly ubiquitous operation, we convert the linear transformation itself into a memory. Stored within this memory are multiple linear transformations in a state of superposition. Context information is used to recall an individual linear transformation from this superposition. An overview of this process is shown in Figure

1.

Instead of trying to share a single model across multiple tasks, individual models for each task can coexist with one another in superposition. Context information dynamically ‘routes’ an input towards a specific model within this superposition. The ability to route inputs to models during training opens up many possibilities in online learning and learning in memory constrained environments.

The goal of this paper is to introduce a general framework for parameter superposition and to empirically demonstrate both its features and limitations. By establishing a foundation for this new framework, we embrace extensions into these domains and beyond.

Figure 1: Models , and are superimposed in . Context

is a vector (e.g. random

values) which implicitly retrieves model via .

2 Background

An associative memory is a form of content-addressable memory where recall is based on similarity to the content of stored patterns (Linster, ). Unlike traditional computer memory which is accessed by an explicit addressing scheme where the address must be exact, associative memories are robust to corruptions during recall. This resilience makes them particularly useful in representation learning where training data is imprecise and the training process itself is stochastic.

Associative memories are categorized by the way in which memories are retrieved. An auto-associative memory is a memory where an approximate (or partial) memory is used to recover the complete memory of itself, with the Hopfield Network (Hopfield, 1982) being a notable example. In contrast, a hetero-associative memory uses content which is different from the memory recovered by that content. From a data structures perspective, this is similar in spirit to a hash-map where a key can be used to retrieve a corresponding value. In this work, we refer to this key as context information.

A holographic reduced representation (HRR) is a hetero-associative memory proposed by Plate (1995) for storing compositional structures. In contrast to Hopfield networks, which store individual items, HRRs store pairs of items in superposition. The act of forming these pairs is called binding. One component of the pair can be retrieved using the other component as a cue. For example, can be used to restore from memory :

where is a binding operator. This is akin to a bidirectional hash-map where a key can be used to retrieve a corresponding value and vis-versa (i.e. reverse lookup). Due to the superposition however, noise is introduced by this retrieval process, and so a ‘clean-up’ process is necessary to restore the retrieved memory to its original state. Plate (1995) suggest using a separate auto-associative memory for clean-up.

3 Motivation

Kanerva (2009) proposed a hyperdimensional computing framework which utilizes simple arithmetic operations to manipulate hetero-associative memories like HRRs. Storage, addressing and erasure are accomplished by addition, multiplication and subtraction operations respectively. Drawing inspiration from this idea, we develop parameter superposition, a memory framework which stores the parameters of learning systems as memories. We relax the requirement of a clean-up process which greatly simplifies our framework. To justify this relaxation, we posit that the parameters of these learning systems are robust to the noise caused by retrieval.

Robustness to noise requires some amount of redundancy. Signs of redundancy can manifest themselves in opportunities for compression. Previous work has shown that the parameters of a neural network can be drastically compressed after training (Wan et al., 2013; Han et al., 2015).

Han et al. (2015) show a majority of trained parameters in networks can be pruned with a simple magnitude-based thresholding procedure. More surprisingly, Frankle & Carbin (2018); Liu et al. (2018) show the pruning mask acquired after training can be applied to the same network before training with little to no adverse impact on learning. Li et al. (2018) found the intrinsic dimensionality of most tasks to be a small fraction of the dimensionality of the parameters. Such results imply that parameters are not being used efficiently in neural networks. This leads us to ask, can we take advantage of any redundancies during training?

Through experiments on multiple datasets, tasks and types of neural networks, we show parameter superposition can indeed exploit the excess capacity present in these models. Furthermore, our framework can exploit this capacity during training.

3.1 Online Learning

Training typically requires a large dataset of labelled examples, from which small batches of data are sampled to evaluate the loss function and consequently update the neural network parameters. For successful training, it has been found critical to construct batches by sampling data uniformly at random. Such random sampling ensures that the training data is independent and identically distributed.

Past attempts to address this problem have employed various mechanisms to overcome this issue: the use of replay buffers in reinforcement learning 

(Mnih et al., 2013), using separate networks for separate tasks (Rusu et al., 2016; Terekhov et al., 2015)

, or heuristics to selectively identify which weights can be changed during training 

(Kirkpatrick et al., 2017; Zenke et al., 2017). The core issue is to come up with a formulation where it is possible to remember the past while maintaining the ability to learn from new data.

4 Parameter Superposition

Figure 2: A Parameter vectors have high similarity. B (Store) will rotate each making them nearly orthogonal. C (Retrieve) will restore but other will remain nearly orthogonal, reducing interference during learning.

With inspiration from the superposition principle of linear systems, we propose a method called Parameter Superposition (PSP) to store many models simultaneously into one set of parameters. We augment the standard linear transformation with a binding operation:

(1)

Context information in the form of generates a context vector . The matrix represents the parameters of the linear transformation. The symbol refers to element-wise multiplication. By multiplying input vector with a context vector , is ‘rotated’ to a particular model stored within .

Since multiplication is associative, we can instead view this as a rotation of a particular model stored in the rows of towards . This rotation is illustrated in the retrieval process shown in Figure 2. An overview of ways to generate these rotations is provided in Section 4.2.

Conceptually, one can think of the parameters as a superposition of parameters where is the dimension of superposition:

(2)

This is similar to the superposition principle in Fourier analysis where a signal is represented as a superposition of sinusoids. Each sinusoid can be considered as a context vector. For the Inverse Fourier transform, the dimension of superposition is the frequency of those sinusoids:

By using the complement (inverse) of a particular Fourier basis (context vector), we can retrieve a stored amplitude value. In our framework, when set is discrete and finite, element can be used to recover individual from by generating a corresponding context vector .

Properties of this recovery process are more clearly illustrated by substituting Equation 2 into Equation 1. By unpacking the inner product operation, Equation 1 can be rewritten as the sum of two terms which is shown in Equation 3.

which is written more concisely in matrix notation as:

(3)

The first term, , is the recovered linear transformation and the second term, , is a residual. For particular formulations of the set of context vectors , is a summation of terms which interfere destructively.

4.1 Analysis of residual

We establish properties of the destructive interference which make it possible to recover a linear transformation from the superposition.

Proposition 1.

in expectation is unbiased, .

Proposition 1 states that, in expectation, other models within the superposition will not introduce a bias to the recovered linear transformation.

Proposition 2.

For , when we bind a random with , under mild conditions. For , let

be a random orthogonal matrix s.t.

has a random direction. Then . In both cases, let . If is large, then and will be relatively small compared to .

If we assume that ’s are equally large and denote it by , then . When is small, the residual introduced by other superimposed models will stay small. Binding with the random keys roughly attenuates each model’s interference by a factor proportional to .

Propositions 1 and 2 show that we can superimpose individual models after training and the interference should stay small. In next proposition, we describe having individual models in superposition during training.

Proposition 3.

Denote the cost function of the network with PSP as used in context , and the cost function of the network without as .

  1. For the complex context vector case, , where is the context vector used in and are the weights of the network.

  2. For the general rotation case in real vector space, , where is the context matrix used in and are the weights of the network.

Proposition 3 shows parameter updates of an individual model in superposition is approximately equal to updates of that model trained outside of superposition. The gradient of parameter superposition creates a superposition of gradients with analogous destructive interference properties to Equation 1. Therefore, memory operations in parameter superposition can be applied in an online fashion. A proof of the propositions above is provided in the appendix.

In the following sections, we describe multiple formulations of context vectors in the superposition framework which promote this destructive interference.

4.2 A more general framework

By replacing element-wise multiply with matrix multiply, Equation 1 can be generalized:

(4)

where . The left action of on can also be viewed as a right action of on the rows of . From this perspective, parameter superposition rotates a row of parameters with a unitary transformation .

is an operator which rotates into a different region of feature space. As generates , directs where is rotated. Through this rotation, controls the degree of overlap between different in superposition. In turn, this controls the amount of interference during learning between different . Figure 2 is a visualization of this binding process. In the following sections, we review various ways to generate these unitary transformations.

4.2.1 Complex Superposition

In Equation 1, we described complex superposition, a form of parameter superposition where complex vectors are used to efficiently generate unitary transformations, .

(5)

where each component is on the complex unit circle . The phase for all

is sampled with uniform probability density

.

This form of parameter superposition is particularly relevant for the predominant form of neural network models. The differentiability with respect to phase enables gradient-based learning through the context. Moreover, the size of the unitary transformation scales linearly with the dimensionality of the input vector.

4.2.2 Binary Superposition

Constraining the phase to two possible values is a special case of complex superposition. The context vectors become . We refer to this formulation as binary superposition. The low-precision of the context vectors in this form of superposition has both computational and memory advantages. Furthermore, binary superposition is directly compatible with both real-valued and low-precision linear transformations.

4.2.3 Rotational Superposition

Binary and complex superposition are special subgroups of the larger group of unitary matrices (i.e. unitary group). For completeness, the superposition principle can also extend to a broader class of orthogonal matrices . This class includes rotations which are not possible with diagonal matrices (e.g. permutation). We refer to this formulation as rotational superposition. These rotations are sampled uniformly from the orthogonal group (Haar distribution)111we use scipy.stats.ortho_group..

Figure 3: The topology of all context operators acting on a vector , e.g. . A binary operates on a lattice B complex operates on a torus C rotational operates on a sphere.

For each type of superposition, Figure 3 provides the geometry of the rotations which can be applied to parameters . This illustrates the topology of the embedding space of superimposed models. The choice of how to parameterize the unitary transformation in superposition depends on the specific application. For example, a continuous topology enables gradient-based learning of the context operator. Without loss of generality, we describe ideas in the following sections from the perspective of complex superposition.

5 From superposition to composition

While a context is an operator on parameter vectors , the context itself can also be operated on. Analogous to the notion of a group in abstract algebra, new contexts can be constructed from a composition of existing contexts under a defined operation. For example, the context vectors in complex superposition form a Lie group under complex multiplication. This enables parameters to be stored and recovered from a composition of contexts:

(6)

By creating functions over the superposition dimension , we can generate new context vectors in a variety of ways. To introduce this idea, we describe two basic compositions.

5.1 Mixture of contexts

The continuity of the phase in complex superposition makes it possible to create mixtures of contexts to generate a smoother transition from one context to the next. One basic mixture is an average window over the previous, current and next context:

(7)

The smooth transitions reduces the orthogonality between neighboring context vectors. Parameters with neighboring contexts can ‘share’ information during learning which is useful for transfer-learning settings and continual learning settings where the domain shift is smooth.

5.2 Powers of a single context

A particularly memory efficient way of generating new contexts is raising a given context to a power (exponentiation). For complex superposition, a given context vector can be used to create new contexts by:

(8)

Superposition with powers has the advantage of a constant-sized memory footprint even as new models are added in superposition. This enables efficient communication of context changes with a single value without the need to store a set of unique contexts.

In complex superposition, context vectors exist in the topology of a complex -dimensional torus. Many other functions

can be defined over this smooth manifold. When these functions are differentiable, gradient-based optimization (e.g. backpropagation) can be used to learn

from data. In this work, we focus on first introducing basic operations over contexts vectors and leave such extensions for future work.

# parameters model
No Superposition
Binary
Complex
Rotational
OnePower (Complex)
Table 1: Parameter count for superposition of a linear transformation of size . ‘+1 model’ refers to the number of additional parameters to add a new model.

6 Neural Network Superposition

We outlined multiple formulations of parameter superposition which involves a simple modification of the standard linear transformation. This transformation is a fundamental operation utilized by most neural network models. We can extend these formulations to entire neural network models by applying superposition (Equation 1) to the linear transformation of all layers of a neural network:

(9)

where

is a non-linear (activation) function.

6.1 Convolutional Networks

For neural networks applied to vision tasks, convolution is currently the dominant operation in a majority of layers. Since the dimensionality of convolution parameters is usually much smaller than the input image, it makes more sense computationally to apply context to the weights rather than the input. By associativity of multiplication, we are able reduce computation by applying a context tensor

to the convolution kernel instead of the input image :

(10)

where is the convolution operator.

7 Experiments

While superposition is a general memory framework for storing parameters and has many potential applications, our goal in this paper is to demonstrate the capacity and robustness of recovering models from the parameter superposition despite thousands of updates to these parameters.

Learning interference occurs when the distribution of training data shifts during training. The problem is so acute that continual learning literature often refers to it as catastrophic interference or forgetting. A network trained on multiple consecutive tasks will suddenly ‘forget’ or perform poorly on earlier tasks. This can be considered a form of overfitting where the model temporally overfits to the data it is currently presented, generalizing poorly to data at other timepoints.

We present experiments on multiple types on non-stationary data to illustrate the ability of online learning using the super-position framework. Table 1 describes the number of parameters required as a reference for the parameter cost of each formulation of superposition.

7.1 Input Interference

A common scenario in online learning is when the input data distribution changes over time (for e.g. as visual properties change from day to night.) Previous works have used datasets such as the The permuting MNIST dataset (Goodfellow et al., 2013) is a variant of the MNIST dataset (LeCun et al., 1998) where the pixels of the input are randomly permuted during online learning. Each new permutation is considered a new task. Since the labels remain the same for each task, there is no change in the output distribution. Though permutation is an unrealistic distribution shift, we believe training on this dataset can provide a good demonstration of the capacity of a model during online learning.

Figure 4: Binary superposition using networks with differing number of units (128 to 2048). Each line is permuting MNIST task 1 accuracy as a function of training on 50 tasks total.

7.1.1 Network size

In previous work, relatively large networks (2000 units) are trained on 10 tasks (10 consecutive permutations) (Zenke et al., 2017). We would like to investigate the impact of the dimensionality of the linear transformation on the superposition framework. To amplify these differences, we train networks of two hidden layers on 50 tasks (1000 steps per task) and vary the number of units (128 to 2048). A new context vector is generated when transitioning to a new task. We use binary superposition (pspBinary) to minimize the differences from the baseline architecture (standard).

In Figure 4, we see consistent improvements as the dimensionality of the feature space grows for parameter superposition but not a standard network of the same architecture. This shows that the superposition framework is making better use of the additional capacity present in the larger models.

7.1.2 Parameter efficiency

Figure 5: Histogram of weight magnitudes for each layer after training on permuting MNIST (50 tasks) using binary superposition (orange) and no superposition (blue).

If parameter superposition better utilizes the training capacity of a neural network, we expect to see a corresponding increase in parameter utilization after training. Following previous work on network compression (Han et al., 2015), we compare how prune-able the weights of a network are after training on permuting MNIST (50 tasks). In Figure 5, we see the presence of higher magnitude weights in a network with superposition than without. This suggests that superposition makes the network less compressible with magnitude based thresholding (pruning). We see this trend is more pronounced the more downstream the layer. We conjecture the permutation operation in permuting MNIST is a rotation operation making it behave similar to a context rotation in superposition at the early layers.

Figure 6: Permuting MNIST task 1 accuracy on test set as a function of training step. At 1 task per 1000 steps, each network has seen 50 tasks in sequence at the end of training.

7.1.3 Types of superposition

To compare the different formulations of parameter superposition, we train two hidden layers with 256 units on 50 tasks (1000 steps per task). In Figure 6, we see that rotational superposition (pspRotation) performs consistently better than other variations of superposition. As rotational superposition is the broadest class of unitary transformations among those tested, it should be expected to make the most use of the available capacity. Furthermore, rotational introduces significantly more parameters than other superposition methods with each additional context (Table 1). Surprisingly, using powers of a single context (pspOnepower) performs at the same level if not better than using independent context vectors. Binary superposition (pspBinary) does not perform as well as others likely because it is the most constrained unitary transform among those tested.

Superposition does not accumulate learning constraints like most continual learning methods Kirkpatrick et al. (2017). To verify previous models stored in superposition do not hinder learning subsequent tasks, we compute the average accuracy across all tasks after training. To compare to previous work, we limit to training on 10 tasks and match network architectures. Results are shown in Table 2.

Avg. Accuracy (%)
(Kirkpatrick et al., 2017)
(Zenke et al., 2017)
No Superposition
Binary
Complex
OnePower (Complex)
Table 2: Average accuracy of a 2000 unit two hidden layer network across 10 permuting MNIST tasks at the end of training. Taken from Figure 4 in Zenke et al. (2017)

7.2 Output Interference

Output interference occurs when there is also a shift in the output (e.g. label) distribution of the training data. For example, this occurs when transitioning from one classification task to another. The incremental CIFAR (iCIFAR) dataset (Rebuffi et al., 2017; Zenke et al., 2017) is a variant of the CIFAR dataset (Krizhevsky & Hinton, 2009) where the first task is the standard CIFAR-10 dataset and subsequent tasks are formed by taking disjoint subsets of 10 classes from the CIFAR-100 dataset.

Even for continual learning methods, the outputs of the last layer are normally modularized with separate distinct parameters for each task (i.e. multi-head network) to prevent interference due to the large output shift (Zenke et al., 2017). To demonstrate the robustness of our approach, we learn on iCIFAR using a single output layer using superposition avoiding the need for a network with multiple output heads.

Figure 7: iCIFAR task 1 (CIFAR-10) accuracy on test set as a function of training step. All networks use a single output layer across all 5 tasks.

We train 6 layer convolutional networks using the superposition formulation in Equation 10 for each convolution layer and Equation 9 for every fully-connected layer. In Figure 7, we see a surprisingly small degradation in performance despite the drastic change in output interference. This demonstrates the ability to implement modular neural networks without instantiating new parameters for new modules.

7.3 Continuous Domain Shift

Most methods including those proposed in continual learning are formulated where the distribution shift is discrete (e.g. permuting MNIST, iCIFAR). But this may be a poor reflection of distribution shift which occurs naturally in online data gathered from the real world. For example, day gradually becomes night and summer gradually becomes winter.

Figure 8: Samples of rotating-MNIST (top) and rotating-FashionMNIST (bottom) datasets. At each training step, the training distribution (green box) shifts by counterclockwise rotation.

To simulate a continuous domain shift, we propose rotating-MNIST and rotating-FashionMNIST which are variants of the original MNIST and FashionMNIST (Xiao et al., 2017) datasets. At each step of training, a two-dimensional rotation is applied to the input images of the dataset. After a sufficient amount of time (i.e steps) has passed, one revolution is completed and the input distribution will return to the starting distribution and another cycle begins. Figure 8 shows examples from these two rotating datasets as a function of time.

Figure 9: rotating-MNIST (top) and rotating-FashionMNIST (bottom) test accuracy at angle 0°as a function of training step. One full rotation occurs every 1000 steps.

First, we compare the performance of models utilizing context vectors similar to previous experiments. After every 100 steps of training, we transition to the next context. Therefore, the context transition speed is 10 transitions per cycle where a cycle is one full revolution (1000 steps) of the input dataset. In Figure 9, the oscillations in model accuracy are significantly reduced using superposition for both rotating-MNIST and rotating-FashionMNIST.

Figure 10: Closer comparison of each form of parameter superposition on the rotating-FashionMNIST task at angle 0°.

In Figure 10, we compare the types of superposition for the rotating-FashionMNIST dataset (top row). pspRotation has the highest peak performance but also shows largest performance drops when other contexts are trained. Again, this is likely because this form of superposition has by far the most parameters. Similar to experiments on permuting MNIST, pspComplex and pspOnePower show the most stable behavior while pspBinary performs slightly worse.

Figure 11: Comparison of different context selection functions on the rotating-FashionMNIST task at angle 0°.

7.3.1 Context functions

For continuous domain shift, the generation of context information becomes more important to avoid abrupt changes in model performance during learning. We compare various methods for shifting between contexts.

pspFast refers to a context transition speed of 1000 transitions per cycle. This is effectively storing 1000 models in superposition and we can see the learning performance has noticeably deteriorated. If we incorporate more prior knowledge into these 1000 contexts by taking a mixture of three contexts (pspFastLocalMix) described in Equation 7, we notice an improvement. By further incorporating a slower rate of context changes to 10 transitions per cycle (pspLocalMix), we notice a slight improvement over pspComplex which uses the same rate of context changes.

8 Related Work

As an online memory for parameters, superposition reduces interference in a fundamentally different way than previous methods. It does not explicitly constrain the learning of any model within the superposition. Most methods in continual learning utilize a constraining loss function (Kirkpatrick et al., 2017). Others freeze and grow parts of the network (Terekhov et al., 2015; Rusu et al., 2016; Zenke et al., 2017)

to actively preserve previous models during training. As a consequence, the ability to learn new tasks becomes more and more limited as these loss constraints and computational costs accumulate. By acting as a memory, superposition provides control in how parameter capacity is allocated at any given moment during learning.

Therefore, like any other memory framework (Graves et al., 2014; Weston et al., 2014), parameter superposition requires a controller in the form of a context function to store and recall memories (i.e. parameters). Sukhbaatar et al. (2015) developed a differentiable form of the slot-based memory presented in Weston et al. (2014) allowing the controller to be learned without direct supervision. In multiple forms of superposition we describe, the context selection function is also differentiable and learnable via backpropagation.

9 Discussion

By making use of redundancies already present in neural network models, we introduce a memory framework which operates on their parameters. Entire neural networks are memories which can be stored or recalled with context information which is far smaller in size than the network itself. When using powers of a context vector, that information can be as small as a single scalar value. Furthermore, model parameters can be retrieved in a discrete or continual fashion which accommodates for different types of distribution shift during training.

The flexibility of this approach in addressing learning interference opens the door to many new online applications. This is particularly useful in domains where memory resources are low both during training and at inference time. On the flip side, much larger modular networks can be trained with the same amount of memory resources (Shazeer et al., 2017). In future work, we can take advantage of the continuity of superposition and define objectives to learn the context functions which control the parameter memory.

Acknowledgements

We thank Andreea Bobu for motivating the rotating mnist task and Jesse Livezey for suggesting a method to sample from the Haar distribution.

References

Appendix A Analysis of retrieval noise

Assume and are fixed vectors and

is a random context vector, each element of which has a unit amplitude and uniformly distributed phase.

a.1 Proposition 1: superposition bias analysis

We consider three cases: real value network with binary context vectors, complex value network with complex context vectors, and real value network with orthogonal matrix context. For each case we show that if the context vectors / matrices have uniform distribution on the domain of their definition the expectation of the scalar product with the context-effected input vector is zero.

Real-valued network with binary context vectors.

Assuming a fixed weights vector

for a given neuron, a fixed pre-context input

, and a random binary context vector with i.i.d. components

because .

Complex-valued network with complex context vectors.

Again, we assume a fixed weights vector for a given neuron, a fixed pre-context input , and a random complex context vector with i.i.d. components, such that for every and the phase of has a uniform distribution on a circle. Then

where is a conjugate of . Here because has a uniform distribution on a circle.

Real-valued network with real-valued rotational context matrices.

Let be a vector and the context vector be a random orthogonal matrix drawn from the Haar distribution (Mezzadri, 2006). Then for fixed defines a uniform distribution on sphere , with radius . Due to the symmetry,

a.2 Proposition 2: Variance induced by context vectors

Similarly to Proposition 1, we give an estimate of the variance for each individual case: a real-valued network with binary context vectors, a complex-valued network with complex context vectors, and a real-valued network with rotational context matrices. All the assumptions are the same as in Proposition 1.

Real-valued network with binary context vectors.

Here we made use of the facts that and are independent variables with zero mean, and that .

Note that

If , then will be relatively small compared to . Indeed, we can assume that each term has a comparably small contribution to the inner product (e.g. when using dropout) and

Then

Complex-valued network with complex context vectors.

Here we used the fact that and

are random variables and that

.

Similarly to the binary case we can assume that each term has a comparable small contribution to the inner product and . Then

Here we make use of the fact that for , which in turn follows from the fact that and are independent variables.

Let’s assume each term has a comparable small contribution to the inner product (e.g. when using dropout). Then it is reasonable to assume that , where is the dimension of . Then

Real-valued network with real-valued rotational context matrices.

We show the case in high dimensional real vector space for a random rotated vector. Let again be a vector and

be a random matrix drawn from the

Haar distribution. Then for fixed defines a uniform distribution on sphere , with radius .

Consider a random vector , whose components are drawn i.i.d. from . Then one can establish a correspondence between and :

thus is under a random rotation in . Let be a random vector where are i.i.d. normal random variables . , and . Then we have:

We further show that and hence

Due to the symmetry,

Let , . Then

So and . Thus .

Let , if we consider the case is large, we have .

Appendix B Online learning with unitary transformations

Here we show that training a model which is in superposition with other models using gradient descent yields almost the same parameter update as training this model independently (without superposition). For example imagine two networks with parameters and combined into one superposition network using context vectors and , such that the parameters of the PSP network . Then, what we show below is that training the PSP network with the context vector results in nearly the same change of parameters as training the network independently and then combining it with using the context vectors.

To prove this we consider two models. The original model is designed to solve task 1. The PSP model is combining models for several tasks. Consider the original model as a function of its parameters and denote it as . Throughout this section we assume to be a vector.

Note that for every , the function defines a mapping from inputs to outputs. The PSP model, when used for task 1, can also be defined as , where is a superposition of all weights.

We define a superposition function , combining weights with any other set of parameters, .

We also define an read-out function which extracts from possibly with some error :

The error must have such properties that the two models and produce similar outputs on the data and have approximately equal gradients and on the data.

When the PSP model is used for task 1, the following holds:

Our goal now is to find conditions of superposition and read-out functions, such that for any input/output data the gradient of with respect to is equal (or nearly equal) to the gradient of with respect to , transformed back to the space. Since for the data the functions and are assumed to be nearly equal together with their gradients, we can omit the error term .

The gradient updates the weights are

Now corresponding to this can be computed using linear approximate of :

and hence

Thus in order for , which is here computed using the gradient of , to be equal to the one computed using the gradient of it is necessary and sufficient that

(11)

b.1 Real-valued network with binary context vectors

Assume a weights vector and a binary context vector . We define a superposition function

Since for binary vectors, the read-out function can be defined as:

In propositions 1 and 2 we have previously shown that in case of binary vectors the error has a small contribution to the inner product. What remains to show is that the condition 11 is satisfied.

Note that

Since for every element , the matrix is orthogonal and hence condition 11 is satisfied.

b.2 Complex-valued network with complex context vectors.

The proof for the complex context vectors is very similar to that for the binary. Let the context vector , s.t. . It is convenient to use the notation of linear algebra over the complex field. One should note that nearly all linear algebraic expressions remain the same, except the transposition operator should be replaced with the Hermitian conjugate which is the combination of transposition and changing the sign of the imaginary part.

We define the superposition operation as

When , where is the element-wise conjugate operator (change of sign of the imaginary part). The read-out function can be defined as:

The necessary condition 11 transforms into

where

is a complex identity matrix, whose real parts form an identity matrix and all imaginary parts are zero. This condition is satisfied for the chosen complex vector because

b.3 Real-valued network with real-valued rotational context matrices

The proof is again very similar to the previous cases. The superposition operation is defined as

where is a rotational matrix.

The read-out function is

where is the transposed of . Here we use the fact that for rotation matrices .

The condition 11 becomes

and hence is satisfied. ∎