Improving Differentiable Neural Computers Through Memory Masking, De-allocation, and Link Distribution Sharpness Control

04/23/2019 ∙ by Róbert Csordás, et al. ∙ IDSIA 18

The Differentiable Neural Computer (DNC) can learn algorithmic and question answering tasks. An analysis of its internal activation patterns reveals three problems: Most importantly, the lack of key-value separation makes the address distribution resulting from content-based look-up noisy and flat, since the value influences the score calculation, although only the key should. Second, DNC's de-allocation of memory results in aliasing, which is a problem for content-based look-up. Thirdly, chaining memory reads with the temporal linkage matrix exponentially degrades the quality of the address distribution. Our proposed fixes of these problems yield improved performance on arithmetic tasks, and also improve the mean error rate on the bAbI question answering dataset by 43

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Although Recurrent Neural Networks (RNNs) such as LSTM

(Hochreiter & Schmidhuber, 1997; Gers et al., 2000) are in theory capable of solving complex algorithmic tasks (Siegelmann & Sontag, 1992), in practice they often struggle to do so. One reason is the large amount of time-varying memory required for many algorithmic tasks, combined with quadratic growth of the number of trainable parameters of a fully connected RNN when increasing the size of its internal state. Researchers have tried to address this problem by incorporating an external memory as a useful architectural bias for algorithm learning (Das et al., 1992; Mozer & Das, 1993; Graves et al., 2014, 2016).

Especially the Differentiable Neural Computer (DNC; Graves et al. (2016)) has shown great promise on a variety of algorithmic tasks – see diverse experiments in previous work (Graves et al., 2016; Rae et al., 2016). It combines a large external memory with advanced addressing mechanisms such as content-based look-up and temporal linking of memory cells. Unlike approaches that achieve state of the art performance on specific tasks, e.g. MemNN (Sukhbaatar et al., 2015) or Key-Value Networks (Miller et al., 2016) for the bAbI dataset (Weston et al., 2015), the DNC consistently reaches near state of the art performance on all of them. This generality makes the DNC worth of further study.

Three problems with the current DNC revolve around the content-based look-up mechanism, which is the main memory addressing system, and the temporal linking used to read memory cells in the same order in which they were written. First, the lack of key-value separation negatively impacts the accuracy of content retrieval. Second, the current de-allocation mechanism fails to remove de-allocated data from memory, which prevents the network from erasing outdated information without explicitly overwriting the data. Third, with each write, the noise from the write address distribution accumulates in the temporal linking matrix, degrading the overall quality of temporal links.

Here we propose a solution to each of these problems. We allow for dynamic key-value separation through a masking of both look-up key and data that is more general than a naive fixed key-value memory, yet does not suffer from loss of accuracy in addressing content. We propose to wipe the content of a memory cell in response to a decrease of its usage counter to allow for proper memory de-allocation. Finally, we reduce the effect of noise accumulation in the temporal linking matrix through exponentiation and re-normalization of the links, resulting in improved sharpness of the corresponding address distribution.

These improvements are orthogonal to other previously proposed DNC modifications. Incorporation of the differentiable allocation mechanism Ben-Ari & Bekker (2017) or certain improvements to memory usage and computational complexity (Rae et al., 2016) might further improve the results reported in this paper. Certain bAbI-specific modifications Franke et al. (2018) are also orthogonal to our work.

We evaluate each of the proposed modifications empirically on a benchmark of algorithmic tasks and on bAbI (Weston et al., 2015). In all cases we find that our model outperforms the DNC. In particular, on the bAbI task we observe a relative improvement in terms of mean error rate. We find that improved de-allocation together with sharpness enhancement leads to zero error and 3x faster convergence on the large repeated copy task, while DNC is not able to solve it at all.

Section 2 provides a brief overview of the DNC. Section 3 discusses identified problems and proposed solutions in more detail. Section 4 analyzes these modifications one-by-one, demonstrating their positive effects.

2 Differentible Neural Computer

Here we provide a brief overview of the Differentiable Neural Computer (DNC). More details can be found in the original work of Graves et al. (2016).

The DNC combines a neural network (called controller) with an external memory that includes several supporting modules (subsystems) to do: read and write memory, allocate new memory cells, chain memory reads in the order in which they were written, and search memory for partial data. A simplified block diagram of the memory access is shown in Fig. 1.

External memory

A main component is the external fixed 2D memory organized in cells (, where N is the number of cells, W is the cell length). is independent of the number of trainable parameters. The controller is responsible for producing the activations of gates and keys controlling the memory transactions. The memory is accessed through multiple read heads and a single write head. Cells are addressed through a distribution over the whole address space. Each cell is read from and written to at every time step as determined by the address distributions, resulting in a differentiable procedure.

Memory addressing

The DNC uses three addressing methods. The most important one is content-based look-up. It compares every cell to a key () produced by the controller, resulting in a score, which is later normalized to get an address distribution over the whole memory. The second is the temporal linking, which has 2 types: forward and backward. They show which cell is being written after and before the one read in the previous time step. They are useful for processing sequences of data. A so-called temporal linkage matrix () is used to project any address distribution to a distribution that follows () or precedes it (). The third is the allocation mechanism, which is used only for write heads, and used when a new memory cell is required.

Memory allocation

Memory allocation works by maintaining usage counters for every cell. These are incremented on memory writes and optionally decremented on memory reads (de-allocation). When a new cell is allocated, the one with the lowest usage counter is chosen. De-allocation is controlled by a gate, which is based on the address distribution of the previous read and decreases the usage counter of each cell.

Read / Write

The memory is first written to, then read from. A write address is generated as a weighted average of the write content-based look-up and the allocation distribution. The update is done in a gated way by erasing vector

. Parallel to the write, the temporal linkage matrix is also updated. Finally the memory is read from. The final read address distribution () is generated as the weighted average of the read content-based look-up distribution and forward and backward temporal links. Memory cells are averaged based on this address, resulting in a single vector, which is the retrieved data. This data is combined with the output of the controller to produce the model’s final output.

Figure 1: Simplified block diagram of DNC’s memory access module with single read head. Yellow boxes denote the inputs from the previous time step, orange boxes are the corresponding outputs to the next time step. Green boxes are the control inputs from the controller. Blue, rounded boxes are modules responsible for a specific function. denotes the write address, the read address, the temporal linkage matrix. is the memory. Arrow ”r” denotes the output of the memory read.

3 Method

3.1 Masked Content-Based Addressing

The goal of content-based addressing is to find memory cells similar to a given key. The query key contains partial information (it is a partial memory), and the content-based memory read completes its missing (unknown) part based on previous memories. However, controlling which part of the key vector to search for is difficult because there is no key-value separation: the entire key and entire cell value are compared to produce the similarity score. This means that the part of the cell value that is unknown during search time and should be retrieved is also used in the normalization part of the cosine similarity, resulting in an unpredictable score. The shorter the known part and the longer the part to be retrieved, the worse the problem. This might result in less similar cells having higher scores, and might make the resulting address distribution flat because of the division by the length of the data before the softmax acts as an increased temperature parameter. Imagine a situation when the network has to tag some of the data written into the memory (for example because it is the start of the sequence in the repeated copy task, see Section

4). For tagging it can use only a single element of the memory cell, using the rest for the data that has to be stored. Later, when the network has to find this tag, it searches for it, specifying only that single element of the key. But the resulting score is also normalized by the data that is stored along with the tag, which takes most of the memory cell, hence the resulting score will almost completely be dominated by it.

The problem can be solved by providing a way of explicitly masking the part that is unknown and should not be used in the query. This is more general than key-value memory, since the key-value separation can be controlled dynamically and does not suffer from the incorrect score problem. We achieve this by producing a separate mask vector through the controller, and multiplying both the search key and the memory content by it before comparing ( is the write key strength controlling the temperature of the softmax):

Fig. 2 shows how the masking step is incorporated in the address generation of the DNC.

3.2 De-allocation and Content-Based Look-up

The DNC tracks allocation states of memory cells by so-called usage counters which are increased on memory writes and optionally decreased after reads. When allocating memory, the cell with the lowest usage is chosen. Decreasing is done by element-wise multiplication with a so-called retention vector (), which is a function of previously read address distributions () and scalar gates. Vector indicates how much of the current memory should be kept. The problem is that it affects solely the usage counters and not the actual memory . But memory content plays a vital role in both read and write address generation: the content based look-up still finds de-allocated cells, resulting in memory aliasing. Consider the repeated copy task (Section 4), which needs repetitive allocation during storage of a sequence and de-allocation after it was read. The network has to store and repeat sequences multiple times. It has to tag the sequence beginning to know from where the repetition should start from. This could be done by content-based look-up. During the repetition phase, each cell read is also de-allocated. However when the repetition of the second sequence starts, the search for the tagged cell can find both the old and the new marked cell with equal score, making it impossible to determine which one is the correct match. We propose to zero out the memory contents by multiplying every cell of the memory matrix with the corresponding element of the retention vector. Then the memory update equation becomes:

(1)

where is the element-wise product, is a vector of ones, is a matrix of ones. Note that the cosine similarity (used for comparing the key to the memory content) is normalized by the length of the memory content vector which would normally cancel the effect of Eq. 1. However, in practice, due to numerical stability, cosine similarity is implemented as

where is a small constant. In practice, free gates tend to be almost 1, so is very close to 0, making the stabilizing constant dominant with respect to the norm of the erased memory content vector. This will assign a low score to the erased cell in the content addressing: the memory is totally removed.

3.3 Sharpness of Temporal Link Distributions

With temporal linking, the model is able to sequentially read memory cells in the same or reverse order as they were written. For example, repeating a sequence is possible without content-based look-up: the forward links can be used to jump to the next cell. Any address distribution can be projected to the next or the previous one through multiplying it by a so-called temporal link matrix () or its transpose. can be understood as a continuous adjacency matrix. On every write all elements of are updated to an extent controlled by the write address distribution (). Links related to previous writes are weakened; the new links are strengthened. If is not one-hot, sequence information about all non-zero addresses will be reduced in and the noise from the current write will also be included repeatedly. This will make forward () and backward () distributions of long-term-present cells noisier and noisier, and flatten them out. When chaining multiple reads by temporal links, the new address is generated through repeatedly multiplying by , making the blurring effect exponentially worse.

We propose to add the ability to improve sharpness of the link distribution ( and ). This does not fix the noise accumulation in the link matrix , but it significantly reduces the effect of exponential blurring behavior when following the temporal links, making the noise in less harmful. We propose to add an additional sharpness enhancement step to the temporal link distribution generation. By exponentiating and re-normalizing the distribution, the network is able to adaptively control the importance of non-dominant elements of the distribution.

(2)

Scalars and should be generated by the controller ( and ). The oneplus nonlinearity is used to bound them in range : and . Note that in Eq. 2 can be numerically unstable. We propose to stabilize it by:

Fig. 2 shows the block diagram of read address generation in DNC with key masking and sharpness enhancement. The sharpness enhancement block is inserted into the forward and backward link generation path right before combining them with the content-based look-up distribution.

Figure 2: Block diagram of read address generation in DNC with key masking and sharpness enhancement. Blue parts indicate new components absent in standard DNC. CMP is a cosine similarity-based comparator. Memory and key are compared after a novel masking step. Before combining temporal links and content-based address distribution, sharpness enhancement takes place.

4 Experiments

To analyze the effects of our modifications we used simple synthetic tasks designed to require most DNC parts while leaving the internal dynamics somewhat human interpretable. The tasks allow for focusing on the individual contributions of specific network components. We also conducted experiments on the much more complex bAbI dataset (Weston et al., 2015).

We tested several variants of our model. For clarity, we use the following notation: DNC is the original network Graves et al. (2016), DNC-D has modified de-allocation, DNC-S has added sharpness enhancement, DNC-M has added masking in content based addressing. Multiple modifications (D, M, S) can be present.

Copy Task

A sequence of length of binary random vectors of length is presented to the network, and the network is required to repeat them. The repeat phase starts with a special input token, after all inputs are presented. To solve this task the network has to remember the sequence, which requires allocating and recalling from memory. However, it does not require memory de-allocation and reuse. To force the network to demonstrate its de-allocation capabilities, instances of such data are generated and concatenated. Because the total length of the sequences exceeds the number of cells in memory, the network is forced to reuse its memory cells. An example is shown in Fig. (a)a.

Associative Recall Task

In the associative recall task (Graves et al. (2014)) blocks of words of length are presented to the network sequentially, with special bits indicating the start of each block. After presenting the input to the network, a special bit indicates the start of the recall phase where a randomly chosen block is repeated. The network needs to output the next block in the sequence.

Key-Value Retrieval Task

The key-value retrieval task demonstrates some properties of memory masking. words of length are presented to the network. Words are divided in two parts of equal length, and . All the words are presented to the network. Next the words are shuffled, parts are fed to the network, requiring it to output the missing part for every . Next, the words are shuffled again, is presented and the corresponding is requested. The network must be able to query its memory using either part of the words to complete this task.

4.1 Implementation details

Our PyTorch implementation is available on

https://github.com/xdever/dnc. We provide equations for our DNC-DMS model in Appendix A. Following Graves et al. (2016)

, we trained all networks using RMSProp (

Tieleman & Hinton (2012)), with a learning rate of , momentum , , . All parameters except the word embedding vectors and biases have a weight decay of

. For task-specific hyperparameters, check Appendix

B.

4.2 The Effect of Modifications

Masking

Fig. (a)a

shows the performance of various models on the associative recall task. The two best performing models use memory masking. From the standard deviation it can be seen that many seeds converge much faster with masking than without. Sharpening negatively impacts performance on this task (see Section

4.3 for further discussion). Note that (we used ) is essential for good performance, while slows down convergence speed (see Equation 4 in Appendix A).

To demonstrate that the system learns to use masking dynamically instead of learning a static weighting, we trained DNC-M on the key-value retrieval task. Fig. (b)b shows how the network changes the mask when the query switches from to . Parts of the mask activated during the period almost complement the part activated during the period, just like the query keys do.

(a) Effect of masking on convergence speed
(b) A sample mask from DNC-M
Figure 5: (a) Mean training loss on the associative recall task. The shaded area shows the mark (12 seeds/model). Masking improves convergence speed. (b) An example read mask of DNC-M in the key-value retrieval task. Yellow values indicate parts of the key the network searches for, the blue values indicate parts that need to be retrieved form memory. When the query switches from to , the mask changes. In the bottom third (in) the input is stored (look-up is not used). For in middle third (q1) is presented in random order and is retrieved. In the last third (q2) is presented in random order and is retrieved.
De-allocation

Graves et al. (2016) successfully trained DNC on the repeat copy task with a small number of repeats () and relatively short length (). We found that increasing makes DNC fail to solve the task (see Fig. (a)a). Fig. (b)b shows that our model solves the task perfectly. Its outputs are clean and sharp. Furthermore it converges much faster than DNC, reaching near-zero loss very quickly. We hypothesize that the reason for this is the modified de-allocation: the network can store the beginning of every sequence with similar key without causing look-up conflicts, as it is guaranteed that the previously present key is wiped from memory. DNC seems able to solve the short version of the problem by learning to use different keys for every repeat step, which is not a general solution. This hypothesis, however, is difficult to prove, as neither the write vector nor the look-up key is easily human-interpretable.

(a) Input, ref output, net output (repeated copy)
(b) Train loss of repeated copy task
Figure 8: (a) Input (top), ground truth (middle), and network output (bottom) of DNC on big repeat copy tasks. DNC fails to solve the task; the output is blurry. The problem is especially apparent starting from

. (b) De-allocating and sharpness enhancement substantially improves convergence speed. The improvement by the masking is marginal, probably because the task uses temporal links.

Sharpness enhancement

To analyze the problem of degradation of temporal links after successive link matrix updates, we examined the forward and backward link distributions ( and ) of the model with modified deallocation (DNC-D). The forward distribution is shown in Fig. (a)a. The problem presented in Section 3.3 is clearly visible: the distribution is blurry; the problem becomes worse with each iteration. Fig. (c)c shows the read mode () of the same run. It is clear that only content based addressing (middle column) is used. When the network starts repeating a block, the weight of forward links (last column) is increased a bit, but as the distribution becomes more blurred, a pure content-based look-up is preferred. Probably it is easier for the network to perform a content-based look-up with a learned counter as a key rather than to learn to restore the corrupted data from blurry reads. Fig. (b)b shows the forward distributions of the model with sharpness enhancement as suggested in Section 3.3 for the same input. The distribution is much sharper, staying sharp until the very end of the repeat block. The read mode () for the same run can be seen in Fig. (d)d. Obviously the network prefers to use the links in this case.

(a) DNC-D (without sharpness enhancement)
(b) DNC-DS (with sharpness enhancement)
(c) DNC-D (without sharpness enhancement)
(d) DNC-DS (with sharpness enhancement)
Figure 13: (a), (b) Example forward link distribution. Each row is an address distribution across all memory cells. Blue cells are not read, yellow cells are read with a large weight. (a) DNC-D: without sharpness enhancement the distributions are blurred, rarely having peaks near 1.0. The problem becomes worse over time. 3 repeats are shown. Notice the more intense blocks for and . (b) Sharpness enhancement (DNC-DS) makes the distribution sharp during the read, peaking near 1.0. Note that (a) and (b) have identical input data. (c), (d) The distribution for (a) and (b). Columns are the weighting of the backward links, the content based look up, and the forward links, respectively. (c) The forward links are barely used without sharpness enhancement. (d) With sharpness enhancement the forward links are used for every block.

4.3 bAbI Experiments

bAbI (Weston et al. (2015)) is an algorithmically generated question answering dataset containing 20 different tasks. Data is organized in sequences of sentences called stories. The network receives the story word by word. When a question mark is encountered, the network must output a single word representing the answer. A task is considered solved if the error rate (number of correctly predicted answer words divided by the number of total predictions) of the network decreases below , as usual for this task.

Manually analyzing bAbI tasks let us to believe that some are difficult to solve within a single time-step. Consider the sample from QA16: “Lily is a swan. Bernhard is a lion. Greg is a swan. Bernhard is white. Brian is a lion. Lily is gray. Julius is a rhino. Julius is gray. Greg is gray. What color is Brian? A: white” The network should be able to “think for a while” about the answer: it needs to do multiple memory searches to chain the clues together. This cannot be done in parallel as the result of one query is needed to produce the key for the next. One solution would be to use adaptive computation time (Schmidhuber (2012), Graves (2016)). However, that would add an extra level of complexity to the network. So we decided to insert a constant blank steps before every answer of the network—a difference to what was done previously Graves et al. (2016). We also use a word embedding layer, instead of one-hot input representation, as is typical for NLP tasks. The embedding layer is a learnable lookup-table that transforms word indices to a learnable vector of length .

In Table 1, we present experimental results of multiple versions of the network after iterations with batch size 2. The performance reported by Graves et al. (2016) is also shown (column Graves et al). Our best performing model (DNC-MD) reduces the mean error rate by

, while having also a lower variance. This model does not use sharpness enhancement, which penalizes mean performance by only

absolute. We hypothesize this is due to the nature of the task, which rarely needs step-to-step transversal of words, but requires many content-based look-ups. When following a path of an object, many words and even sentences might be irrelevant between the clues, so the sequential linking in order of writing is of little to no use. Compare Franke et al. (2018) where the authors completely removed the temporal linking for bAbI. However, we argue that for other kinds of tasks the link distribution sharpening might be very important (see Fig. (b)b, where sharpening helps and masking does not).

Mean test error curves are shown on Fig. 14 in Appendix C. Our models converge faster and have both lower error and lower variance than DNC. (Note that our goal was not to achieve state of the art performance (Santoro et al., 2017; Henaff et al., 2016; Dehghani et al., 2018) on bAbI. It was to exhibit and overcome certain shortcomings of DNC.)

Task DNC (ours) DNC-MDS DNC-DS DNC-MS DNC-MD Graves et al
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
mean
Table 1: bAbI error rates of different models after 0.5M iterations of training [%]

5 Conclusion

We identified three drawbacks of the traditional DNC model, and proposed fixes for them. Two of them are related to content-based addressing: (1) Lack of key-value separation yields uncertain and noisy address distributions resulting from content-based look-up. We mitigate this problem by a special masking method. (2) De-allocation results in memory aliasing. We fix this by erasing memory contents in parallel to decreasing usage counters. (3) We try to avoid the blurring of temporal linkage address distributions by sharpening the distributions.

We experimentally analyzed the effect of each novel modification on synthetic algorithmic tasks. Our models achieved convergence speed-ups on all them. In particular, modified de-allocation and masking in content-based look-up helped in every experiment we performed. The presence of sharpness enhancement should be treated as a hyperparameter, as it benefits some but not all tasks. Unlike DNC, DNC+MDS solves the large repeated copy task. DNC-MD improves the mean error rate on bAbI by . The modifications are easy to implement, add only few trainable parameters, and hardly affect execution time.

In future work we’ll investigate in more detail when sharpness enhancement helps and when it is harmful, and why. We also will investigate the possibility of merging our improvements with related work (Ben-Ari & Bekker, 2017; Rae et al., 2016), to further improve the DNC.

Acknowledgments

The authors wish to thank Sjoerd van Steenkiste, Paulo Rauber and the anonymous reviewers for their constructive feedback. We are also grateful to NVIDIA Corporation for donating a DGX-1 as part of the Pioneers of AI Research Award and to IBM for donating a Minsky machine. This research was supported by an European Research Council Advanced Grant (no: 742870).

References

Appendix A Implementation details

Here we present the equations for our full model (DNC-MDS). The other models can be easily implemented in line with details of Section 3. We also highlight differences to the DNC by Graves et al. (2016).

The memory at step t is represented by matrix , where N is the number of cells, W is the word length. The network receives an input and produces output . The controller of the network receives input vector concatenated with all (number of read heads) read vectors from the previous step, and produces output vector . The controller can be an LSTM or feedforward network, and may have single or multiple layers. The controller’s output is mapped to the interface vector by matrix by . An immediate output vector is also generated: . The output interface vector is split into many sub-vectors controlling various parts of the network:

(3)

Notation: is the read head index; are the keys used for read content-based address generation; are the read key strengths (); is the query key used for content-based address generation for writes; is the write key strength; is the erase vector which acts as an in-cell gate for memory writes; is the write vector which is the actual data being written; are the free gates controlling whether to de-allocate the cells read in the previous step; is the allocation gate; is the write gate; are the read modes (controlling whether to use temporal links or content-based look-up distribution as read address); are the forward sharpness enhancement coefficients; are the backward sharpness enhancement coefficients.

Special care must be taken of the range of lookup masks and . It must be limited to , where is a small real number. A close to 0 might harm gradient propagation by blocking gradients of masked parts of key and memory vector.

(4)

We suggest initializing biases for and to 1 to avoid low initial gradient propagation.

Content-based look-up is used to generate an address distribution based on matching a key against memory content:

(5)

Compare this to the of Graves et al. (2016).

Where is the row-wise cosine similarity with numerical stabilization:

(6)

The memory is first written to, then read from. To write the memory, allocation and content-based lookup distributions are needed. Allocation is calculated based on usage vectors . These are updated with the help of memory retention vector :

(7)
(8)

Operation is the element-wise multiplication. Free list is the list of indices of sorted memory locations in ascending order of their usage . So is the index of the least used location. Then allocation address distribution is

The write address distribution is:

(9)

Memory is updated by ( is a vector of ones, is a matrix of ones):

(10)

Compare this to the of Graves et al. (2016).

To track the temporal distance of memory allocations, a temporal link matrix is maintained. It is a continuous adjacency matrix. A helper quantity called precedence weighting is defined: and

(11)

Forward and backward address distributions are given by and :

(12)

Compare this to the and of Graves et al. (2016).

The read address distribution is given by:

(13)
(14)

Finally, memory is read, and the output is calculated:

Appendix B Hyperparameters for the experiments

Copy Task.

We use an LSTM controller with hidden size 32, memory of 16 words of length 16, 1 read head. is 8, with the 9th bit indicating the start of the repeat phase. is randomly chosen from range , from range . Batch size is 16.

Associative Recall Task.

We use a single-layer LSTM controller (size 128), memory of 64 cells of length 32, 1 read head. , , , batch size of 16.

Key-Value Retrieval Task.

We use a single-layer LSTM controller of size 32, 16 memory cells of length 32, 1 read head. , .

bAbI.

Our network has a single layer LSTM controller (hidden size of 256), 4 read heads, word length of 64, and 256 memory cells. Embedding size is , batch size is 2.

Appendix C Additional bAbI results

Figure 14: Mean test error of various models during the training. Shadowed area shows .