Hierarchical Memory Networks

by   Sarath Chandar, et al.

Memory networks are neural networks with an explicit memory component that can be both read and written to by the network. The memory is often addressed in a soft way using a softmax function, making end-to-end training with backpropagation possible. However, this is not computationally scalable for applications which require the network to read from extremely large memories. On the other hand, it is well known that hard attention mechanisms based on reinforcement learning are challenging to train successfully. In this paper, we explore a form of hierarchical memory network, which can be considered as a hybrid between hard and soft attention memory networks. The memory is organized in a hierarchical structure such that reading from it is done with less computation than soft attention over a flat memory, while also being easier to train than hard attention over a flat memory. Specifically, we propose to incorporate Maximum Inner Product Search (MIPS) in the training and inference procedures for our hierarchical memory network. We explore the use of various state-of-the art approximate MIPS techniques and report results on SimpleQuestions, a challenging large scale factoid question answering task.


page 1

page 2

page 3

page 4


Question Generation from Paragraphs: A Tale of Two Hierarchical Models

Automatic question generation from paragraphs is an important and challe...

Metalearned Neural Memory

We augment recurrent neural networks with an external memory mechanism t...

Learning Visual Question Answering by Bootstrapping Hard Attention

Attention mechanisms in biological perception are thought to select subs...

Out-of-core Training for Extremely Large-Scale Neural Networks With Adaptive Window-Based Scheduling

While large neural networks demonstrate higher performance in various ta...

Why can't memory networks read effectively?

Memory networks have been a popular choice among neural architectures fo...

End-To-End Memory Networks

We introduce a neural network with a recurrent attention model over a po...

Dog Identification using Soft Biometrics and Neural Networks

This paper addresses the problem of biometric identification of animals,...

Code Repositories


This is the code&dataset for the paper [Hierarchical Memory Networks for Answer Selection on Unknown Words. COLING 2016]

view repo


This project collects the different accepted papers and their link to Arxiv or Gitxiv

view repo

1 Introduction

Until recently, traditional machine learning approaches for challenging tasks such as image captioning, object detection, or machine translation have consisted in complex pipelines of algorithms, each being separately tuned for better performance. With the recent success of neural networks and deep learning research, it has now become possible to train a single model end-to-end, using backpropagation. Such end-to-end systems often outperform traditional approaches, since the entire model is directly optimized with respect to the final task at hand. However, simple encode-decode style neural networks often underperform on knowledge-based reasoning tasks like question-answering or dialog systems. Indeed, in such cases it is nearly impossible for regular neural networks to store all the necessary knowledge in their parameters.

Neural networks with memory Graves et al. (2014); Weston et al. (2015a)

can deal with knowledge bases by having an external memory component which can be used to explicitly store knowledge. The memory is accessed by reader and writer functions, which are both made differentiable so that the entire architecture (neural network, reader, writer and memory components) can be trained end-to-end using backpropagation. Memory-based architectures can also be considered as generalizations of RNNs and LSTMs, where the memory is analogous to recurrent hidden states. However they are much richer in structure and can handle very long-term dependencies because once a vector (i.e., a memory) is stored, it is copied from time step to time step and can thus stay there for a very long time (and gradients correspondingly flow back time unhampered).

There exists several variants of neural networks with a memory component: Memory Networks Weston et al. (2015a)

, Neural Turing Machines (NTM) 

Graves et al. (2014), Dynamic Memory Networks (DMN) Kumar et al. (2015). They all share five major components: memory, input module, reader, writer, and output module.

Memory: The memory is an array of cells, each capable of storing a vector. The memory is often initialized with external data (e.g. a database of facts), by filling in its cells with a pre-trained vector representations of that data.

Input module: The input module is to compute a representation of the input that can be used by other modules.

Writer: The writer takes the input representation and updates the memory based on it. The writer can be as simple as filling the slots in the memory with input vectors in a sequential way (as often done in memory networks). If the memory is bounded, instead of sequential writing, the writer has to decide where to write and when to rewrite cells (as often done in NTMs).

Reader: Given an input and the current state of the memory, the reader retrieves content from the memory, which will then be used by an output module. This often requires comparing the input’s representation or a function of the recurrent state with memory cells using some scoring function such as a dot product.

Output module: Given the content retrieved by the reader, the output module generates a prediction, which often takes the form of a conditional distribution over multiple labels for the output.

For the rest of the paper, we will use the name memory network to describe any model which has any form of these five components. We would like to highlight that all the components except the memory are learnable. Depending on the application, any of these components can also be fixed. In this paper, we will focus on the situation where a network does not write and only reads from the memory.

In this paper, we focus on the application of memory networks to large-scale tasks. Specifically, we focus on large scale factoid question answering. For this problem, given a large set of facts and a natural language question, the goal of the system is to answer the question by retrieving the supporting fact for that question, from which the answer can be derived. Application of memory networks to this task has been studied in Bordes et al. (2015). However, Bordes et al. (2015)

depended on keyword based heuristics to filter the facts to a smaller set which is manageable for training. However heuristics are invariably dataset dependent and we are interested in a more general solution which can be used when the facts are of any structure. One can design soft attention retrieval mechanisms, where a convex combination of all the cells is retrieved or design hard attention retrieval mechanisms where one or few cells from the memory are retrieved. Soft attention is achieved by using softmax over the memory which makes the reader differentiable and hence learning can be done using gradient descent. Hard attention is achieved by using methods like REINFORCE 

Williams (1992)

, which provides a noisy gradient estimate when discrete stochastic decisions are made by a model.

Both soft attention and hard attention have limitations. As the size of the memory grows, soft attention using softmax weighting is not scalable. It is computationally very expensive, since its complexity is linear in the size of the memory. Also, at initialization, gradients are dispersed so much that it can reduce the effectiveness of gradient descent. These problems can be alleviated by a hard attention mechanism, for which the training method of choice is REINFORCE. However, REINFORCE can be brittle due to its high variance and existing variance reduction techniques are complex. Thus, it is rarely used in memory networks (even in cases of a small memory).

In this paper, we propose a new memory selection mechanism based on Maximum Inner Product Search (MIPS) which is both scalable and easy to train. This can be considered as a hybrid of soft and hard attention mechanisms. The key idea is to structure the memory in a hierarchical way such that it is easy to perform MIPS, hence the name Hierarchical Memory Network (HMN). HMNs are scalable at both training and inference time. The main contributions of the paper are as follows:

  • We explore hierarchical memory networks, where the memory is organized in a hierarchical fashion, which allows the reader to efficiently access only a subset of the memory.

  • While there are several ways to decide which subset to access, we propose to pose memory access as a maximum inner product search (MIPS) problem.

  • We empirically show that exact MIPS-based algorithms not only enjoy similar convergence as soft attention models, but can even improve the performance of the memory network.

  • Since exact MIPS is as computationally expensive as a full soft attention model, we propose to train the memory networks using approximate MIPS techniques for scalable memory access.

  • We empirically show that unlike exact MIPS, approximate MIPS algorithms provide a speedup and scalability of training, though at the cost of some performance.

2 Hierarchical Memory Networks

In this section, we describe the proposed Hierarchical Memory Network (HMN). In this paper, HMNs only differ from regular memory networks in two of its components: the memory and the reader.

Memory: Instead of a flat array of cells for the memory structure, HMNs leverages a hierarchical memory structure. Memory cells are organized into groups and the groups can further be organized into higher level groups. The choice for the memory structure is tightly coupled with the choice of reader, which is essential for fast memory access. We consider three classes of approaches for the memory’s structure: hashing-based approaches, tree-based approaches, and clustering-based approaches. This is explained in detail in the next section.

Reader: The reader in the HMN is different from the readers in flat memory networks. Flat memory-based readers use either soft attention over the entire memory or hard attention that retrieves a single cell. While these mechanisms might work with small memories, with HMNs we are more interested in achieving scalability towards very large memories. So instead, HMN readers use soft attention only over a selected subset of the memory. Selecting memory subsets is guided by a maximum inner product search algorithm, which can exploit the hierarchical structure of the organized memory to retrieve the most relevant facts in sub-linear time. The MIPS-based reader is explained in more detail in the next section.

In HMNs, the reader is thus trained to create MIPS queries such that it can retrieve a sufficient set of facts. While most of the standard applications of MIPS Ram and Gray (2012); Bachrach et al. (2014); Shrivastava and Li (2014) so far have focused on settings where both query vector and database (memory) vectors are precomputed and fixed, memory readers in HMNs are learning to do MIPS by updating the input representation such that the result of MIPS retrieval contains the correct fact(s).

3 Memory Reader with -MIPS attention

In this section, we describe how the HMN memory reader uses Maximum Inner Product Search (MIPS) during learning and inference.

We begin with a formal definition of -MIPS. Given a set of points and a query vector , our goal is to find


where the returns the indices of the top- maximum values. In the case of HMNs, corresponds to the memory and corresponds to the vector computed by the input module.

A simple but inefficient solution for -MIPS involves a linear search over the cells in memory by performing the dot product of with all the memory cells. While this will return the exact result for -MIPS, it is too costly to perform when we deal with a large-scale memory. However, in many practical applications, it is often sufficient to have an approximate result for -MIPS, trading speed-up at the cost of the accuracy. There exist several approximate -MIPS solutions in the literature Shrivastava and Li (2014, 2015); Bachrach et al. (2014); Neyshabur and Srebro (2015).

All the approximate -MIPS solutions add a form of hierarchical structure to the memory and visit only a subset of the memory cells to find the maximum inner product for a given query. Hashing-based approaches Shrivastava and Li (2014, 2015); Neyshabur and Srebro (2015) hash cells into multiple bins, and given a query they search for -MIPS cell vectors only in bins that are close to the bin associated with the query. Tree-based approaches Ram and Gray (2012); Bachrach et al. (2014) create search trees with cells in the leaves of the tree. Given a query, a path in the tree is followed and MIPS is performed only for the leaf for the chosen path. Clustering-based approaches Auvolat et al. (2015) cluster cells into multiple clusters (or a hierarchy of clusters) and given a query, they perform MIPS on the centroids of the top few clusters. We refer the readers to Auvolat et al. (2015) for an extensive comparison of various state-of-the-art approaches for approximate -MIPS.

Our proposal is to exploit this rich approximate -MIPS literature to achieve scalable training and inference in HMNs. Instead of filtering the memory with heuristics, we propose to organize the memory based on approximate -MIPS algorithms and then train the reader to learn to perform MIPS. Specifically, consider the following softmax over the memory which the reader has to perform for every reading step to retrieve a set of relevant candidates:


where is the representation of the query, is the memory with being the total number of cells in the memory. We propose to replace this softmax with which is defined as follows:


where is the indices of top- MIP candidate cells and is a sub-matrix of where the rows are indexed by .

One advantage of using the is that it naturally focuses on cells that would normally receive the strongest gradients during learning. That is, in a full softmax, the gradients are otherwise more dispersed across cells, given the large number of cells and despite many contributing a small gradient. As our experiments will show, this results in slower training.

One problematic situation when learning with the is when we are at the initial stages of training and the -MIPS reader is not including the correct fact candidate. To avoid this issue, we always include the correct candidate to the top- candidates retrieved by the -MIPS algorithm, effectively performing a fully supervised form of learning.

During training, the reader is updated by backpropagation from the output module, through the subset of memory cells. Additionally, the log-likelihood of the correct fact computed using -softmax is also maximized. This second supervision helps the reader learn to modify the query such that the maximum inner product of the query with respect to the memory will yield the correct supporting fact in the top candidate set.

Until now, we described the exact -MIPS-based learning framework, which still requires a linear look-up over all memory cells and would be prohibitive for large-scale memories. In such scenarios, we can replace the exact -MIPS in the training procedure with the approximate -MIPS. This is achieved by deploying a suitable memory hierarchical structure. The same approximate -MIPS-based reader can be used during inference stage as well. Of course, approximate -MIPS algorithms might not return the exact MIPS candidates and will likely to hurt performance, but at the benefit of achieving scalability.

While the memory representation is fixed in this paper, updating the memory along with the query representation should improve the likelihood of choosing the correct fact. However, updating the memory will reduce the precision of the approximate -MIPS algorithms, since all of them assume that the vectors in the memory are static. Designing efficient dynamic -MIPS should improve the performance of HMNs even further, a challenge that we hope to address in future work.

3.1 Reader with Clustering-based approximate -Mips

Clustering-based approximate -MIPS was proposed in Auvolat et al. (2015) and it has been shown to outperform various other state-of-the-art data dependent and data independent approximate -MIPS approaches for inference tasks. As we will show in the experiments section, clustering-based MIPS also performs better when used to training HMNs. Hence, we focus our presentation on the clustering-based approach and propose changes that were found to be helpful for learning HMNs.

Following most of the other approximate -MIPS algorithms, Auvolat et al. (2015)

converts MIPS to Maximum Cosine Similarity Search (MCSS) problem:


When all the data vectors have the same norm, then MCSS is equivalent to MIPS. However, it is often restrictive to have this additional constraint. Instead, Auvolat et al. (2015) appends additional dimensions to both query and data vectors to convert MIPS to MCSS. In HMN terminology, this would correspond to adding a few more dimensions to the memory cells and input representations.

The algorithm introduces two hyper-parameters, and . The first step is to scale all the vectors in the memory by the same factor, such that . We then apply two mappings, and , on the memory cells and on the input vector, respectively. These two mappings simply concatenate new components to the vectors and make the norms of the data points all roughly the same Shrivastava and Li (2015). The mappings are defined as follows:


We thus have the following approximation of MIPS by MCSS for any query vector :


Once we convert MIPS to MCSS, we can use spherical -means Zhong (2005) or its hierarchical version to approximate and speedup the cosine similarity search. Once the memory is clustered, then every read operation requires only dot-products, where is the number of cluster centroids.

Since this is an approximation, it is error-prone. As we are using this approximation for the learning process, this introduces some bias in gradients, which can affect the overall performance of HMN. To alleviate this bias, we propose three simple strategies.

  • Instead of using only the top- candidates for a single read query, we also add top- candidates retrieved for every other read query in the mini-batch. This serves two purposes. First, we can do efficient matrix multiplications by leveraging GPUs since all the -softmax in a minibatch are over the same set of elements. Second, this also helps to decrease the bias introduced by the approximation error.

  • For every read access, instead of only using the top few clusters which has a maximum product with the read query, we also sample some clusters from the rest, based on a probability distribution log-proportional to the dot product with the cluster centroids. This also decreases the bias.

  • We can also sample random blocks of memory and add it to top- candidates.

We empirically investigate the effect of these variations in Section 5.5.

4 Related Work

Memory networks have been introduced in Weston et al. (2015a) and have been so far applied to comprehension-based question answering Weston et al. (2015b); Sukhbaatar et al. (2015), large scale question answering Bordes et al. (2015) and dialogue systems Dodge et al. (2015). While Weston et al. (2015a) considered supervised memory networks in which the correct supporting fact is given during the training stage, Sukhbaatar et al. (2015) introduced semi-supervised memory networks that can learn the supporting fact by itself. Kumar et al. (2015); Xiong et al. (2016)

introduced Dynamic Memory Networks (DMNs) which can be considered as a memory network with two types of memory: a regular large memory and an episodic memory. Another related class of model is the Neural Turing Machine 

Graves et al. (2014), which is uses softmax-based soft attention. Later Zaremba and Sutskever (2015) extended NTM to hard attention using reinforcement learning. Dodge et al. (2015); Bordes et al. (2015) alleviate the problem of the scalability of soft attention by having an initial keyword based filtering stage, which reduces the number of facts being considered. Our work generalizes this filtering by using MIPS for filtering. This is desirable because MIPS can be applied for any modality of data or even when there is no overlap between the words in a question and the words in facts.

The softmax arises in various situations and most relevant to this work are scaling methods for large vocabulary neural language modeling. In neural language modeling, the final layer is a softmax distribution over the next word and there exist several approaches to achieve scalability. Morin and Bengio (2005) proposes a hierarchical softmax based on prior clustering of the words into a binary, or more generally -ary tree, that serves as a fixed structure for the learning process of the model. The complexity of training is reduced from to

. Due to its clustering and tree structure, it resembles the clustering-based MIPS techniques we explore in this paper. However, the approaches differ at a fundamental level. Hierarchical softmax defines the probability of a leaf node as the product of all the probabilities computed by all the intermediate softmaxes on the way to that leaf node. By contrast, an approximate MIPS search imposes no such constraining structure on the probabilistic model, and is better thought as efficiently searching for top winners of what amounts to be a large ordinary flat softmax. Other methods such as Noice Constrastive Estimation 

Mnih and Gregor (2014) and Negative Sampling Mikolov et al. (2013) avoid an expensive normalization constant by sampling negative samples from some marginal distribution. By contrast, our approach approximates the softmax by explicitly including in its negative samples candidates that likely would have a large softmax value. Jean et al. (2015) introduces an importance sampling approach that considers all the words in a mini-batch as the candidate set. This in general might also not include the MIPS candidates with highest softmax values.

Spring and Shrivastava (2016) is the only work that we know of, proposing to use MIPS during learning. It proposes hashing-based MIPS to sort the hidden layer activations and reduce the computation in every layer. However, a small scale application was considered and data-independent methods like hashing will likely suffer as dimensionality increases.

5 Experiments

In this section, we report experiments on factoid question answering using hierarchical memory networks. Specifically, we use the SimpleQuestions dataset Bordes et al. (2015). The aim of these experiments is not to achieve state-of-the-art results on this dataset. Rather, we aim to propose and analyze various approaches to make memory networks more scalable and explore the achieved tradeoffs between speed and accuracy.

5.1 Dataset

We use SimpleQuestions Bordes et al. (2015) which is a large scale factoid question answering dataset. SimpleQuestions consists of 108,442 natural language questions, each paired with a corresponding fact from Freebase. Each fact is a triple (subject,relation,object) and the answer to the question is always the object. The dataset is divided into training (75910), validation (10845), and test (21687) sets. Unlike Bordes et al. (2015) who additionally considered FB2M (10M facts) or FB5M (12M facts) with keyword-based heuristics for filtering most of the facts for each question, we only use SimpleQuestions, with no keyword-based heuristics. This allows us to do a direct comparison with the full softmax approach in a reasonable amount of time. Moreover, we would like to highlight that for this dataset, keyword-based filtering is a very efficient heuristic since all questions have an appropriate source entity with a matching word. Nevertheless, our goal is to design a general purpose architecture without such strong assumptions on the nature of the data.

5.2 Model

Let be the vocabulary of all words in the natural language questions. Let be a matrix where each row is some dimensional embedding for a word in the question vocabulary. This matrix is initialized with random values and learned during training. Given any question, we represent it with a bag-of-words representation by summing the vector representation of each word in the question. Let ,

Then, to find the relevant fact from the memory M, we call the -MIPS-based reader module with as the query. This uses Equation 3 and 4 to compute the output of the reader . The reader is trained by minimizing the Negative Log Likelihood (NLL) of the correct fact.

where is the index of the correct fact in . We are fixing the memory embeddings to the TransE Bordes et al. (2013) embeddings and learning only the question embeddings.

This model is simpler than the one reported in Bordes et al. (2015) so that it is esay to analyze the effect of various memory reading strategies.

5.3 Training Details

We trained the model with the Adam optimizer Kingma and Ba (2014), with a fixed learning rate of 0.001. We used mini-batches of size 128. We used 200 dimensional embeddings for the TransE entities, yielding 600 dimensional embeddings for facts by concatenating the embeddings of the subject, relation and object. We also experimented with summing the entities in the triple instead of concatenating, but we found that it was difficult for the model to differentiate facts this way. The only learnable parameters by the HMN model are the question word embeddings. The entity distribution in SimpleQuestions is extremely sparse and hence, following Bordes et al. (2015), we also add artificial questions for all the facts for which we do not have natural language questions. Unlike Bordes et al. (2015)

, we do not add any other additional tasks like paraphrase detection to the model, mainly to study the effect of the reader. We stopped training for all the models when the validation accuracy consistently decreased for 3 epochs.

5.4 Exact -MIPS improves accuracy

In this section, we compare the performance of the full soft attention reader and exact -MIPS attention readers. Our goal is to verify that -MIPS attention is in fact a valid and useful attention mechanism and see how it fares when compared to full soft attention. For -MIPS attention, we tried . We would like to emphasize that, at training time, along with candidates for a particular question, we also add the

-candidates for each question in the mini-batch. So the exact size of the softmax layer would be higer than

during training. In Table 2, we report the test performance of memory networks using the soft attention reader and -MIPS attention reader. We also report the average softmax size during training. From the table, it is clear that the -MIPS attention readers improve the performance of the network compared to soft attention reader. In fact, smaller the value of is, better the performance. This result suggests that it is better to use a -MIPS layer instead of softmax layer whenever possible. It is interesting to see that the convergence of the model is not slowed down due to this change in softmax computation (as shown in Figure 2).

Model Test Acc. Avg. Softmax Size
Full-softmax 59.5 108442
10-MIPS 62.2 1290
50-MIPS 61.2 6180
100-MIPS 60.6 11928
1000-MIPS 59.6 70941
Clustering 51.5 20006
PCA-Tree 32.4 21108
WTA-Hash 40.2 20008
Figure 1: Accuracy in SQ test-set and average size of memory used. 10-softmax has high performance while using only smaller amount of memory.
Figure 2: Validation curve for various models. Convergence is not slowed down by k-softmax.

This experiment confirms the usefulness of -MIPS attention. However, exact -MIPS has the same complexity as a full softmax. Hence, to scale up the training, we need more efficient forms of -MIPS attention, which is the focus of next experiment.

5.5 Approximate -MIPS based learning

As mentioned previously, designing faster algorithms for -MIPS is an active area of research. Auvolat et al. (2015) compared several state-of-the-art data-dependent and data-independent methods for faster approximate -MIPS and it was found that clustering-based MIPS performs significantly better than other approaches. However the focus of the comparison was on performance during the inference stage. In HMNs, -MIPS must be used at both training stage and inference stages. To verify if the same trend can been seen during learning stage as well, we compared three different approaches:

Clustering: This was explained in detail in section 3.

WTA-Hash: Winner Takes All hashing Vijayanarasimhan et al. (2014) is a hashing-based -MIPS algorithm which also converts MIPS to MCSS by augmenting additional dimensions to the vectors. This method used hash functions and each hash function does different random permutations of the vector. Then the prefix constituted by the first elements of each permuted vector is used to construct the hash for the vector.

PCA-Tree: PCA-Tree Bachrach et al. (2014) is the state-of-the-art tree-based method, which converts MIPS to NNS by vector augmentation. It uses the principal components of the data to construct a balanced binary tree with data residing in the leaves.

For a fair comparison, we varied the hyper-parameters of each algorithm in such a way that the average speedup is approximately the same. Table 2 shows the performance of all three methods, compared to a full softmax. From the table, it is clear that the clustering-based method performs significantly better than the other two methods. However, performances are lower when compared to the performance of the full softmax.

As a next experiment, we analyze various the strategies proposed in Section 3.1 to reduce the approximation bias of clustering-based -MIPS:

Top-K: This strategy picks the vectors in the top clusters as candidates.

Sample-K: This strategy samples clusters, without replacement, based on a probability distribution based on the dot product of the query with the cluster centroids. When combined with the Top- strategy, we ignore clusters selected by the Top- strategy for sampling.

Rand-block: This strategy divides the memory into several blocks and uniformly samples a random block as candidate.

We experimented with 1000 clusters and 2000 clusters. While comparing various training strategies, we made sure that the effective speedup is approximately the same. Memory access to facts per query for all the models is approximately 20,000, hence yielding a 5X speedup.

1000 clusters 2000 clusters
Top-K Sample-K rand-block Test Acc. epochs Test Acc. epochs
Yes No No 50.2 16 51.5 22
No Yes No 52.5 68 52.8 63
Yes Yes No 52.8 31 53.1 26
Yes No Yes 51.8 32 52.3 26
Yes Yes Yes 52.5 38 52.7 19

Accuracy in SQ test set and number of epochs for convergence.

Results are given in Table 5.5. We observe that the best approach is to combine the Top-K and Sample-K strategies, with Rand-block not being beneficial. Interestingly, the worst performances correspond to cases where the Sample-K strategy is ignored.

6 Conclusion

In this paper, we proposed a hierarchical memory network that exploits -MIPS for its attention-based reader. Unlike soft attention readers, -MIPS attention reader is easily scalable to larger memories. This is achieved by organizing the memory in a hierarchical way. Experiments on the SimpleQuestions dataset demonstrate that exact -MIPS attention is better than soft attention. However, existing state-of-the-art approximate -MIPS techniques provide a speedup at the cost of some accuracy. Future research will investigate designing efficient dynamic -MIPS algorithms, where the memory can be dynamically updated during training. This should reduce the approximation bias and hence improve the overall performance.