Distributed Memory based Self-Supervised Differentiable Neural Computer

07/21/2020
by   Taewon Park, et al.
KYUNGPOOK NATIONAL UNIVERSITY
0

A differentiable neural computer (DNC) is a memory augmented neural network devised to solve a wide range of algorithmic and question answering tasks and it showed promising performance in a variety of domains. However, its single memory-based operations are not enough to store and retrieve diverse informative representations existing in many tasks. Furthermore, DNC does not explicitly consider the memorization itself as a target objective, which inevitably leads to a very slow learning speed of the model. To address those issues, we propose a novel distributed memory-based self-supervised DNC architecture for enhanced memory augmented neural network performance. We introduce (i) a multiple distributed memory block mechanism that stores information independently to each memory block and uses stored information in a cooperative way for diverse representation and (ii) a self-supervised memory loss term which ensures how well a given input is written to the memory. Our experiments on algorithmic and question answering tasks show that the proposed model outperforms all other variations of DNC in a large margin, and also matches the performance of other state-of-the-art memory-based network models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/07/2018

Robust and Scalable Differentiable Neural Computer for Question Answering

Deep learning models are often not easily adaptable to new tasks and req...
05/25/2019

Neural Stored-program Memory

Neural networks powered with external memory simulate computer behaviors...
07/23/2019

Metalearned Neural Memory

We augment recurrent neural networks with an external memory mechanism t...
04/23/2019

Improving Differentiable Neural Computers Through Memory Masking, De-allocation, and Link Distribution Sharpness Control

The Differentiable Neural Computer (DNC) can learn algorithmic and quest...
06/02/2021

Learning to Rehearse in Long Sequence Memorization

Existing reasoning tasks often have an important assumption that the inp...
03/18/2020

Progress Extrapolating Algorithmic Learning to Arbitrary Sequence Lengths

Recent neural network models for algorithmic tasks have led to significa...
09/28/2020

An Entropic Associative Memory

Natural memories are associative, declarative and distributed. Symbolic ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Memory augmented neural network (MANN) has proven to be an essential component on many tasks which need long term context understanding [graves2014neural, graves2016hybrid, gulcehre2018dynamic, sukhbaatar2015end, weston2014memory]

. Compared to Recurrent Neural Networks (RNNs), such as Long Short Term Memory (LSTM) 

[hochreiter1997long], it can store more information from sequential input data and correctly recall desired information from an external memory with a given cue. Especially, Differentiable Neural Computer (DNC) [graves2016hybrid] is a well-known general-purpose MANN which is inspired by Von Neumann architecture. It adopts content-based addressing, which obtains memory addresses for storing contents from the contents itself and intentionally designed for end-to-end training. Also, the DNC has its own built-in reallocation function for efficient memory management. With such functionalities, the DNC shows a promising performance on various application domains [graves2016hybrid, moro2018cross, mufti2019iterative, putin2018reinforced]. Moreover, there are attempts to improve the DNC’s performance on algorithmic and question answering (QA) tasks by applying several generalization techniques for neural networks or enhancing pre-existing functions of conventional DNC  [csordas2019improving, franke2018robust, rae2016scaling]. However, even with such improvements, the DNC still shows weakness in various QA tasks. Furthermore, it does not have an explicit target function for memory performance, which makes it spend too much time for training.

We hypothesize those weaknesses of the DNC results comes from a single memory based representation and the lack of explicit control for its memorization performance. For the tasks which require to store input contents for diverse purpose or need to retrieve multiple associations exists in the input data, a single memory based representation shows limited expressive power when storing complex relation in data [csordas2019improving, franke2018robust, graves2014neural, graves2016hybrid, rae2016scaling]. Also for the memorization performance of the DNC, it does not provide any explicit functionality to ensure how much its external memory retains correct information. Therefore, if a given task involves complex reasoning based on a long story input for QA, it inevitably takes a long time until it can learn proper usage of its memory.

In this paper, to address those problems, we introduce a novel Distributed Memory based Self-Supervised Differentiable Neural Computer (DMSDNC) architecture. Inspired by how information is encoded in the human brain [doi:10.1076/jhin.10.3.308.9086, lashley1950search]

for better generalization, we adopt multiple memory-based distributed representation for our new external memory architecture. In this architecture, multiple distributed memory blocks are updated independently to store a given input content, and stored information is retrieved cooperatively through soft-attention based interpolation over distributed memory blocks. This read/write process is conceptually close to Distributed Memory (DM) 

[doi:10.1076/jhin.10.3.308.9086, flynn1989sparse, kanerva1988sparse, lashley1950search, wu2018kanerva] which can provide robust and diverse information representation. Similarly, it enables DNC to learn to store diverse representations of the same input content for many different purposed tasks. As shown in other MANN models [Banino2020MEMO, munkhdalai2019metalearned, weston2014memory], storing given information with multiple forms according to its relation to a target is one of the key factors for solving complex reasoning problem.

Moreover, for the enhanced memorization performance of DNC, we present a novel loss function which expedites the learning process of DNC in a self-supervised way. Our new loss function, named as Self-Supervised Memory Loss (SML), predicts the stored memory contents during the training process to ensure the correct memorization of input contents. It also keeps the balance between target task objective and SML by dynamic re-weighting method 

[cui2019class, liu2006influence] which is based on stochastic sampling.

We combine our proposed distributed memory architecture with a SML function for efficient end-to-end training of DNC. Our proposed DMSDNC model learns to remember diverse relational information in multiple distributed forms and provides much faster training speed. In experiments, we demonstrate that the proposed model outperforms not only the original DNC but also all its variations on a diverse set of learning problems, which includes algorithmic tasks, QA tasks on the bAbI dataset, and sequential MNIST task.

2 Background

2.1 Differentiable Neural Computer

DNC [graves2016hybrid] is a memory augmented neural network inspired by conventional computer architecture and mainly consists of two parts, a controller and an external memory. When input data are provided to the controller, usually LSTM, it generates memory operators, such as key and value, using its internal state, . Based on those memory operators, every read/write operation on DNC is conducted.

In the writing process, DNC finds a writing address, , where is a memory address size, along with write memory operators, e.g. write-in key, and built-in functions. Then it updates write-in values, in the external memory, , along with erasing value, , where is a memory length size as follows:

(1)

where denotes element-wise multiplication and is .

In the reading process, DNC searches a reading address, , for read heads, along with read memory operators, e.g. read-out key. Next, it reads out information from the external memory:

(2)

Finally, the output is computed as , where . Through these operations, DNC can learn how to store input data and utilize stored information to solve a given task. These whole mechanisms make DNC suitable for a general purposed memory augmented neural network.

[width=0.9]fig/whole_architecture.pdf

Figure 1: (a) The DM with sub-memory blocks and attentive interpolation, . (b) Self-supervised memory loss.

2.2 Related Works of Differentiable Neural Computer

There are many attempts to improve the weakness existing in conventional DNC while maintaining its advantage. rae2016scaling pointed out that the increase in computational resources depends on the memory size of DNC because of its memory accessing method. In order to reduce such computational cost, they adopted a sparse memory accessing model (SDNC) [rae2016scaling], which considers only a few sparse positions in the memory for information storage. In other works [csordas2019improving], the weakness of conventional DNC’s built-in functions was addressed. csordas2019improving pointed out the key/value separation problem of content-based addressing and applied a mask for DNC memory operations as a solution. Also, they added a de-allocation method to the memory updating function and made the temporal order information more sharply addressable with hard-attention. Recently, franke2018robust tried to improve the performance of DNC on QA tasks. In that research, they removed the temporal linkage and applied well-known neural network generalization techniques, such as drop-out [srivastava2014dropout] and layer normalization [lei2016layer] to DNC.

Although those approaches enhanced the DNC performance on algorithmic and QA tasks compared to the conventional DNC, they still have a limited ability for complex reasoning tasks, such as the induction task in the bAbI dataset. To address such limitation, some researchers suggested new memory architecture which adopted two memory system which stores different contents of asynchronous input for each memory  [le2018dual], or memory architecture for graph-structured input [pham2018relational]. However, such works are only applicable to specific application domains or input types, and they are hard to be regarded as a general approach for solving the problem. To find a more general strategy for the problem, we investigate the several state-of-the-art MANN models [Banino2020MEMO, munkhdalai2019metalearned] and find out that their major design factor for learning complex reasoning tasks includes a diverse representation of input information. Based on this finding, we adopt a distributed memory model of the human brain to our architecture for representational diversity. Furthermore, we propose a new Self-Supervised Memory Loss function which is explicitly targeting the correct memorization of input data in the DNC. Through SML, DNC learns the most proper representation for input data, which is more robust and generalizable to diverse tasks.

3 Proposed Method

In this section, we propose a novel multiple distributed memory-based DNC architecture and a SML function which improves conventional DNC performance on complex reasoning tasks.

3.1 Distributed Memory Architecture

The multiple distributed memory architecture consists of a controller network and memory blocks where each memory block is a content addressable memory similar to the original DNC [graves2016hybrid]. Figure 1

(a) shows the overall read/write process of the proposed DMSDNC. For the writing operation, the controller of DMSDNC produces multiple writing weight vectors for multiple memory blocks. Each writing weight vector is used for the content-based addressing of one of the multiple memory blocks and it is independent of other memory blocks. Since it is produced based on the current input and previous hidden states of the controller, it can independently store its own representation of the same input contents. This writing process enables DMSDNC to store the diverse representations of the same input data to multiple memory blocks with much flexibility. Furthermore, for the reading process, all memory blocks are read at the same time and read values are interpolated with soft attention to produce single read-out information. Through this attention-based reading process, DMSDNC retrieves the most suitable information for the current task from distributed representations existing in the multiple memory blocks. Based on these read/write operations, DMSDNC learns how to store and retrieve the diverse representations of input data for different purposed tasks. The following sections detail the main operations.

3.1.1 Controller for Multiple Memory Blocks

At each time step, the controller receives an external input, , read-out of the previous time step, , and previous hidden state of controller, , to update its current hidden state, . After layer normalization, it produces an interface vector, , which includes read and write parameters for multiple memory access.

3.1.2 Write into Multiple Memory Blocks

The multiple memory writing process in our architecture is based on the content-based memory accessing mechanism of DNC. A single memory block is addressed and updated with the same procedure of DNC and such single memory block updating is applied to all blocks independently at the same time. As shown in Eq. (3), memory block relevant weights, , where , are multiplied with controller hidden state vector, , and used for independent memory operations of each memory block.

(3)

where is a memory operator for each memory block and is an attentive gate.

Based on a writing operator obtained from , the controller decides a writing address, , for each memory block, , and writes input content concurrently, as follows:

(4)

where is an erasing value and is a write-in value, for each memory block.

That independence in memory block writing procedure provides representational diversity to our DMSDNC architecture. The following attention-based reading process contributes to the distributed representation of input data, which has superior generalization ability.

3.1.3 Read from Multiple Memory Blocks

As in the writing process, the controller obtains a reading operator from , and computes read address of multiple memory blocks,  , for each read head. Based on those addresses, a preliminary read-out value  is derived for each memory block, , as follows:

(5)

These preliminary read-out values are interpolated by soft-attention as shown in Eq. (6) to produce read-out value, . At the end, this is provided to the controller.

(6)

where the attentive gates, , controls which memory block should be pay more attention. The attentive gates interpolate read-out values from the multiple memory block with Softmax function:

(7)

This whole reading process enables our DMSDNC to learn to use multiple memory for distributed representation according to a target task.

3.2 Self-Supervised Memory Loss

Conventional DNC and its variations are trained with a task-specific target objective function as shown in Eq. (8).

(8)

where is a whole sequence size, is a function at time , which indicates whether current phase is in answer or not, if its value is 1, then is in answer phases (otherwise 0). is a target answer and is a cross-entropy loss function.

Those models are always trained to predict correct answers for current tasks without any consideration for their memorization performance. Therefore, to further enhance the memorization performance of the memory-based models, it is necessary to build a new loss function for memorization. Here, we propose a new SML function, , which updates its memory contents based on sampled input data in a self-supervised way. Our SML function uses a sampled input sequence as its target data as shown in Eq. (9), and leads the model to memorize given input information while it is learning the given task.

(9)

where is a cross-entropy loss function, and is a input, is an output at time step , respectively.

As shown in Figure 1

(b), SML learns to memorize sampled input sequences based on binomial sampling. For QA tasks, a story input sequence is sampled with each trial probability

, in which we call refreshing probability as follows:

(10)

where is an indicator function that represents sampling status at time .

When adding SML to the task-specific target objective for model training, we also need a new strategy that can control the balance between SML and original target task loss. Since, as the number of story input increases, the SML can overwhelm the total loss of the model. To prevent this loss imbalance problem, we apply a re-weighting method [cui2019class, liu2006influence], which dynamically keeps the balance between the target task objective  and memory loss . Moreover, we also introduce a scaling factor, , to ensure the main portion of training loss can be the original target objective function.

(11)
(12)

where is an indicator function which represents whether current time step is in the story phase or not.

Finally, the total loss for the training of proposed model follows:

(13)

4 Experimental Analysis

We evaluate each of our main contributions, Distributed Memory architecture and SML, separately for ablation study, and show the performance of overall DMSDNC architecture for complex QA tasks. Furthermore, we apply DMDNC to the tasks in different domains, such as sequential MNIST, to show its general performance as a memory network. Additional experiments are further described in supplemental materials. In ablation study, four tasks, a distributed representation task, a bAbI task [weston2015towards], and copy and associative recall tasks [graves2014neural] are adopted according to the purpose of experiments. In DM experiments, we show that our distributed memory blocks mechanisms can recall correct information from distributive representation stored in multiple memory with given cues, and in SML evaluations, it is shown that not only the learning speed of DNC but also its memorization performance are enhanced by SML. Eventually, we show that our proposed model outperforms all other DNC variants and matches the other state-of-the-art MANN models. As an application for other domain, we also evaluated DMSDNC for the pixel-sequence based image recognition tasks. In all experiments, we adopt well-known neural network generalization techniques that are used in franke2018robust for our baseline modes. The detailed parameter settings and adopted generalization techniques are shown in the supplemental materials.

4.1 Distributed Memory Architecture Evaluation

[width=]fig/dr_accuracy.pdf

Figure 2: Mean training curves of distributed representation task for segments. left: , center: , right:

. The shadowed area represents a standard deviation of 10 trials.

The distributed memory architecture is evaluated with two different configurations. First, for the representation diversity experiment, DNC memory is divided into sub-memory blocks so that it can have a similar total memory size and the same number of trainable parameters. To prevent information loss is caused by too small sub-memory block size, we limit the number of sub-memories to 3. Second, to show the scalability of DM as the number of sub-memory block increases, we fix the sub-memory block size and evaluate the model performance while continuously adding a sub-memory block. In this setting, the total amount of memory size linearly increases with the number of sub-memory block count.

Representational Diversity Experiment

We create a novel Distributed Representation Task for distributed memory evaluation. In this task, each input is divided into segments, and, among them, segments are randomly picked to construct a cue vector. From the sequences of these input data, the network has to predict the rest of the input segments when its cue vector is provided as a clue at each time step. Therefore, job complexity increases according to . This task is intended to show how much detail of diverse input data and the relations among segments can be memorized correctly. For a fair comparison, we configure Distributed Memory based DNC (DMDNC) by dividing the original memory to or sized sub-memory blocks. The mean training curves of DMDNC-1, 2, and 3 are compared with original DNC while increasing the task complexity as in Figure 2. The result demonstrates that our proposed architecture learns the task much faster than other DNC based models, and also shows better accuracy and learning stability (smaller standard deviation in learning curve). In this task, job complexity increase with the number of segments in a input.

[width=0.3]fig/scalability.pdf

Figure 3: Mean error rate of DMDNC models

Therefore, as increases, DMDNC-2 and 3 show a enhanced accuracy than other baseline models which clearly demonstrates improved representational diversity of our distributed memory architecture. Also, it shows how efficiently the relations among many segments are stored in the memory and correctly recovered from incomplete cues.

Scalability Experiment on Distributed Memory

For the evaluation of the scalability of distributed memory architecture without the effect of information loss at a sub-memory block, we adopt a fixed size sub-memory block that has a larger length than a half of the input size and then increase the number of sub-memory blocks to produce several models, DMDNC-2, 3, and 4. We evaluate all model’s performance with complex reasoning tasks, bAbI task, to show the effect of on representation diversity. The bAbI task [weston2015towards] is a set of 20 different tasks for evaluating text understanding and reasoning, such as basic induction and deduction. In Figure 3, the DMDNC-1 represents a baseline model that has no memory division but includes modifications [franke2018robust] for the generalization performance enhancement. The overall graph shows that as the degree of distribution increases, performance on bAbI tasks is also enhanced accordingly. If we use more sub-memory blocks to further increase , we can obtain gradual performance enhancement, which clearly shows the benefits of distributed memory architecture. However, it also linearly increases the amount of total memory and computational resources required to train the network. Therefore, there should be a trade-off between the performance gain obtained by model selection and physically available resources.

4.2 Self-Supervised Memory Loss Evaluation

[width=]fig/copy.pdf

(a) Copy task

[width=]fig/assoctivate_recall.pdf

(b) Associative Recall task
Figure 4: Mean training curves for different refreshing probability values on (a) the copy task and (b) the associative recall task. The shadowed area shows a standard deviation of 10 trials.

We evaluate the effect of SML on the copy and the associative recall tasks from graves2014neural. The copy task is designed to show whether a model can store and recall arbitrary long sequential data correctly, and the associative recall task is intended to show whether a model can recall the information associated with a given cue by remembering temporal relation between input data. Figure 4 shows the mean training curves with respect to the different values of refreshing probability, , on the copy task and the associative recall task, respectively. For DMSDNC, although we show only DMSDNC-3, other configurations(DMSDNC-2, DMSDNC-4) have similar results. As shown in the figures, the SML function expedites the learning speed of models in most cases. For the original DNC case, the effect of SML is clear. It makes model training speed much faster and it is further increased by the high refreshing probability on both tasks. For DMSDNC, the SML clearly increases the training speed of the models and such distributed memory models are not very sensitive to the change of refreshing probability. From those results, we think that the effect of SML is closely related to the property of a given task. For the case of bAbI task, compared to the copy and the associative recall tasks, memorizing more story input than a certain threshold (higher refreshing probability) does not provide many benefits to the final model performance.

Model DNC [graves2016hybrid] SDNC [rae2016scaling] rsDNC [franke2018robust] DNC-MD [csordas2019improving] NUTM [Le2020Neural] DMSDNC-2 (0.1) DMSDNC-2 (0.3)
Mean 16.7 7.6 6.4 2.5 6.3 2.7 9.5 1.6 5.6 1.9 1.53 1.33 2.30 1.08
Best 3.8 2.9 3.6 n/a 3.3 0.16 0.14
Table 1: The word error rate [%] of different DNC based models trained jointly on all 20 bAbI task.
Model MNM [munkhdalai2019metalearned] MEMO [Banino2020MEMO] DMSDNC-2 (0.1)
Mean 0.208 0.033 0.86 1.11 0.202 0.032
Best 0.175 0.21 0.16
Table 2: The word error rate for top 5 runs of MAMN models trained jointly on all 20 bAbI tasks.

4.3 Distributed Memory based Self-Supervised DNC

As shown in the ablation study, each component of the proposed architecture has a significant impact on the original DNC performance. To show the performance of the whole combined model, DMSDNC, we compare our architecture to other DNC’s variations on the bAbI task. Table 1 shows experimental results on the bAbI task. In this experimental result, our proposed model, DMSDNC-2 with , shows the best mean performance on the bAbI task, among all other DNC based approaches. These results demonstrate that our proposed architecture efficiently learns the bAbI task by using multiple distributed memory blocks and memory oriented loss. Particularly, in Table 2, the best result of DMSDNC-2 records the state-of-the-art performance on the bAbI task, even compared to other recent MANN models without DNC. In the model configuration, DMSDNC-2 is the optimal model configuration for bAbI tasks, when we use the same total memory size as DNC and divide memory into smaller sub-blocks for efficient memory utilization.

4.4 Pixel Sequence based Image Recognition

[width=0.5]fig/mnist.pdf

Figure 5: Test accuracy on MNIST and pMNIST.

This task is a pixel-by-pixel image classification task [lamb2016professor, le2015simple] on MNIST [lecun1998gradient] in which pixels of each image are fed into a memory network in the scan order or the permuted order. Permuted pixel sequence has higher job complexity because neighborhood information of a pixel is lost by permutation. Figure 5 shows the test accuracy comparison between DNC and our DMDNC models. The results show that our DMDNC model has superior training speed compared to the conventional DNC, and can be further enhanced by increasing the degree of distribution (from DMDNC-2 to DMDNC-3).

5 Conclusion and Future Work

In this paper, we present a novel distributed memory architecture and a SML function, named DMSDNC, to enhance memory augmented neural network. The proposed distributed memory model stores input contents to the multiple memory blocks with diverse representation and retrieves required information with soft-attention based interpolation over multiple distributed memories. Also, we adopt a novel loss function, self-supervised memory loss, to explicitly improve the memorization performance of DNC. It is designed to update the contents of DNC memory with sampled input data in a self-supervised way. The evaluation results demonstrate that our DMSDNC correctly stores input information and robustly recall the stored information based on the purpose of a given task. Also, we show that our SML function improves the learning speed of a memory augmented neural network model by self-refreshing the content of memory. Eventually, our DMSDNC significantly outperforms all other variations of DNC and shows the state-of-the-art performance on the QA tasks even compared to the other meta-learning based memory augmented neural network models.

In future work, we will investigate the relation between DM and SML to find an optimal configuration of the model according to the application domain. Currently, DM is not fully taking advantage of SML because of its multiple memory architecture. We think SML can be further enhanced to effectively refresh multiple memories. Furthermore, we expect DM with SML can provide a powerful strategy for enhancing memory network performance which generally applicable.

References