Labeled Memory Networks for Online Model Adaptation

07/05/2017 ∙ by Shiv Shankar, et al. ∙ IIT Bombay 0

Augmenting a neural network with memory that can grow without growing the number of trained parameters is a recent powerful concept with many exciting applications. We propose a design of memory augmented neural networks (MANNs) called Labeled Memory Networks (LMNs) suited for tasks requiring online adaptation in classification models. LMNs organize the memory with classes as the primary key.The memory acts as a second boosted stage following a regular neural network thereby allowing the memory and the primary network to play complementary roles. Unlike existing MANNs that write to memory for every instance and use LRU based memory replacement, LMNs write only for instances with non-zero loss and use label-based memory replacement. We demonstrate significant accuracy gains on various tasks including word-modelling and few-shot learning. In this paper, we establish their potential in online adapting a batch trained neural network to domain-relevant labeled data at deployment time. We show that LMNs are better than other MANNs designed for meta-learning. We also found them to be more accurate and faster than state-of-the-art methods of retuning model parameters for adapting to domain-specific labeled data.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

While deep learning models achieve impressive accuracy in supervised learning tasks such as computer vision 

[Krizhevsky, Sutskever, and Hinton2012], translation [Wu2016], and speech recognition [Yu and Deng2014] training such models places significant demands on the amount of labeled data, computational resources, and manual efforts in tuning. Training is thus considered an infrequently performed resource-intensive batch process. Many applications require trained models to quickly adapt to new examples and settings during deployment. Consider the online sequence prediction task that arises in applications such as text auto-completion, user trajectory prediction, or next-url prediction. Even when the prediction model has been trained on many sequences, next-token predictions on new sequences can benefit from true tokens observed online. Another example is image recognition where a trained model should be able to recognize new objects not seen during batch training based on a few examples. This has led to a surge of interest in few-shot learning  [Kaiser et al.2017, Santoro et al.2016, Vinyals et al.2016, Snell, Swersky, and Zemel2017]

. Few-shot learning is an artificially simplified setting where a small set of new labels define independent classification tasks. We consider the more useful but more difficult and less explored task where an existing image classifier has to be extended to handle new labels.

An established technique for model adaptation is to retune part or all of the model parameters using the domain-labeled data in a separate adaptation phase [Daumé III]. More recently, deep meta-networks have been proposed that ”learn to learn” such adaptation [Rei2015, Finn, Abbeel, and Levine2017, Huang et al.2015, Ravi and Larochelle2017]. Parameter retuning methods typically require a one-time adaptation step, and operating them in online settings is slow. Secondly with a pre-trained network the multiple gradient updates of meta-learning tend to destroy useful information learned by the PCN. This phenomenon is closely related to ’catastrophic forgetting’ [French1999, Kirkpatrick et al.2016].

An alternative to parameter retuning is memorizing. In this approach neural networks are augmented with memory that can grow without correspondingly increasing the number of parameters to be trained. Many exciting uses have been found of such MANNs including program learning [Graves, Wayne, and Danihelka2014], question answering [Weston2016, Gülçehre et al.2016], learning rare events [Kaiser et al.2017], and meta learning [Santoro et al.2016]. They hold promise for model adaptation because memory they can cut short the conventional path of iterative training to percolate new facts to model parameters.

However, our initial attempts at using existing MANNs like NTMs [Graves, Wayne, and Danihelka2014] and DNTMs [Gülçehre et al.2016] for online sequence prediction did not improve the baseline model. One major challenge is correctly balancing the roles of the memory and batch-trained model. On the one hand, we have a shared model trained over several instances, and on the other hand we have the few but more relevant instances encountered during testing. Recent MANNs designed for meta-learning [Santoro et al.2016]

partition their roles by using the shared model to learn an embedding and the memory to implement a nearest neighbor like classifier. While this architecture works for few-shot learning where all labels are new, they cannot adapt classification models with softmax layers over a large shared label space.

Contributions

In this paper we present a new MANN called Labeled Memory Network (LMN) that provides an easy, fast, and plug-and-play solution for adapting pre-trained models. We highlight three design decisions that made LMNs suitable for such adaptation.

First, we apply ideas from boosting [Schapire1999] and treat memory as a second-stage classifier that is updated only when the current loss is non-zero. Existing MANNs write to memory during every pass. Even when the goal of the model is to use the memory to remember rare events [Kaiser et al.2017], the memory stores all events not just the rare ones. This causes a lot of memory to be wasted in storing non-rare vectors often displacing the rare ones.

Second, we propose a ’labeled memory’ where the primary means of addressing a memory cell is by a class label. This is in contrast to all existing MANNs that use a controller to generate a hidden vector of the input as key. This simple, yet surprisingly unexplored alternative organization111Our method of using labels for addressing is very different from the discrete addressing mechanism proposed in [Gülçehre et al.2016]. Even though both result in discrete addressing, in our case the key is the label whereas in [Gülçehre et al.2016] the key continues to be computed from input . has two advantages: First, the controller is freed from the vaguely supervised task of generating keys that are distinctive and relevant leading to better memory use. Second, it safeguards against catastrophic forgetting of rare labels that current LRU based MANNs are subject to.

Third, we use the power of a RNN to adaptively define the roles of the memory and the neural network in a label dependent manner. This is unlike traditional boosting where the stage weights are fixed and derived based on simple functions of the error of each stage.

LMNs can be used for online adaptation in a variety of settings. We compare not only with existing MANNs but also state of the art meta-learning methods that retune parameters [Rei2015, Finn, Abbeel, and Levine2017]. On online sequence prediction tasks spanning five applications like user trajectory predictions and next-click prediction, we show significant gains. Second, we report higher accuracy in two different settings of the popular Omniglot image classification task. Finally, we present results from two popular language modeling benchmarks and report improvements over state of the art.

Online Model Adaptation using Labeled Memory Networks (LMNs)

Problem Description

Online model adaptation kicks in at the time of deploying a batch trained model. Our focus is classification models in this paper. During deployment the model sees inputs one at a time in sequence. At a time for the input we predict the label , after which the true output is revealed. The online model adapter decides how to improve the next input’s prediction by combining the limited labeled data with the batch trained model. The adaptation has to be fast and performed synchronously at each step. We are agnostic about how the classifier is batch trained. However, we assume that the adapter can be trained using several such sequences representative of the deployment scenario. Many problems such as user trajectory prediction, language modeling, and few-shot learning can be cast in such formulation.

We describe our method of online adaptation in two phases. We first explain how LMNs make predictions for an input , and next how they adapt when the true label is revealed. LMNs comprise of three components: the primary classification network (PCN), the memory module, and a combiner network. Our architecture is depicted graphically in 0(a). We describe each component of the architecture next.

Primary Classification Network (PCN)

This refers to the batch trained neural network that we seek to adapt. The PCN may be stateful or stateless. For example, in applications like trajectory prediction the sequence of inputs are stateful, whereas in tasks like image classification the inputs are stateless. Our only assumption is that the last layer of PCN transform each into a real vector before feeding to a softmax layer to predict a distribution over the class labels. We use to denote the softmax parameter for class , so the score for predicting class for input is

(1)

Memory

The memory consists of cells. Each cell is a 3-tuple with: denoting the label of cell , denoting the hidden vector stored in , denoting a weight attached to the cell. This storage format is similar to what is used in existing MANNs. But the way in which we read, use, and update the memory is very different. In existing MANNs memory is viewed as a part of the main network. Memory values are read based on matching a key with the stored vector and the read values are processed by the main network to produce the output. In contrast we view memory as a learner, specifically an online learner that is only loosely integrated with the PCN. We describe here our alternative design.

We index the memory with the class label so that given a label , we can enumerate all cells with label . The memory provides a score over each class label for an input . We use the PCN last layer output as an embedding of on the basis of which we can compute the kernel between and a memory vector as . Given and we read a vector along with a scalar weight calculated as

(2)

This method of reading is very similar to soft addressing used in most memory models. But the key difference is that we take average only over cells with label , and not all cells.

This label specific vector read from memory is used to get a score for a class as:

(3)

where is a strength parameter. The memory can thus be viewed as a kernel classifier with denoting the weight of the kernel.

When memory is large, the time to compute memory scores could be a concern but it is easy to get fast approximate scores using a similarity index [Guo et al.2016, Kaiser et al.2017], and approximate nearest neighbor search has been used for large scale memories [Rae et al.2016, Chandar et al.2016].

Combining memory and PCN

The final score for a label is computed by combining pcn-scores and memory scores

with an interpolation parameter

. We made a dynamic variable, because we expect the relative importance of memory and PCN to evolve as more labeled data becomes available online. Following Adaboost  [Schapire1999], one way to choose as a function of the two classifier stages. However we obtained better results by using a RNN to determine as a function of label and time as well. The RNN works on a label dependent state and at each step takes three sets of input. First is the input embedding , second are binary indicators

which indicate whether the memory and PCN had error at the previous output, and the third set are label probabilities predicted by the memory and PCN. The RNN acts at each step on the state to produce label dependent outputs, from which the relative weights of memory and PCN are obtained via a sigmoid layer.

(4)

The final predicted distribution over the labels for an input is

(5)
Training the combiner network

The combiner parameters are trained to minimize the loss on over a labeled sequence of instances that are representative of the deployment setting. We take a pre-trained PCN and augment it with the memory and combiner modules. This new network is then trained in the online setting, by providing it data in a sequential manner.

An advantage of this architecture is that memory and PCN are loosely coupled, allowing the modules to be plugged in over existing models and can mix at varying levels depending on the amount of per-domain data. We show in the experimental section that even with fixed to a single hyper parameter (over all ), LMNs are competitive with state of the art model adaptation methods.

(b) Memory network
(a) LMN
Figure 1: Comparing Memory Networks and LMN
(a) LMN

Adapting to true label

When the true label of is revealed we take two steps: update the state of the combiner RNN as discussed in Equation 4, and write to memory if needed. We describe our method for memory updates next.

The memory is partitioned across labels, and each memory cell is a tuple of label, content and . The way in which we update our memory is consistent with the role that memory serves of being a second classification stage that is updated online. We next discuss the three parts of our memory write strategy: when to write, how to write, what to replace and why. The overall algorithm is depicted in Figure 2.

When to write

Like in online learners, we update memory only when the margin of separation between the scores of true and closest incorrect label is small. Specifically we update only if is less than a margin threshold. For example, instances that are confidently classified by the PCN may never be fed to the memory. In contrast, existing MANNs write to memory for every instance and are prone to fill up the memory on cases that can be accurately handled by PCN.

What to write

Consider the content and weight associated with each memory cell at time . We update the content of cells with label using where is the fractional similarity of to contents of cell (Equation 2). The weight of the cell is incremented by the fractional similarity to after decaying old weights. Further, if the prediction is incorrect we attempt to create a new cell in memory by replacing an existing cell.

What cell to replace

We replace the cell with the smallest weight among all cells with the same label as . Since we update based on its fractional similarity to instances that had non-zero loss, this method essentially replaces the cell that is least useful for defining the classification task. Unlike existing MANNs that replace a cell least recently used across all labels, we replace only among cells with label . A pure LRU based replacement could easily lead to forgetting of rare classes. Our method of replacement enables us to implement a fairer classifier atop the memory.

  Input:
  
  
  
  If , no update. Return.
  New cell :
  
  
  
  
  if  and  then
     Replace cell with new cell .
  end if
Figure 2: Memory updates in Labeled memory network. In our experiments we used a decay value of 0.99. The margin is a hyper-parameter.

Our memory updates are analogous to gradient updates in an online SVM learner. However we operate in a limited memory setting and merge and delete support vectors similar to budgeted-PEGASOS [Wang, Crammer, and Vucetic2010]. If memory vectors are considered as SVM parameters, then the gradient w.r.t these parameters of the objective is precisely the current vector . In an online SVM learner when the loss is non-zero, the input element is added to the set of support vectors. Similarly in LMN, when margin loss is non-zero, the current vector is added to the memory. The contribution of this specific memory update for the memory scores in subsequent steps will then be proportional to . This acts as an online-SVM learner in the dual form, where the are the support vectors, and act as the associated dual variables.

Related Work

Memory in neural networks

The earliest example of the use of memory in neural networks is attention BahdanauCB14. Attention as memory is slow for long histories, leading to the development of several more flexible memory-based architectures [Weston2016]

. Neural Turing Machines (NTMs) 

[Graves, Wayne, and Danihelka2014]

were developed for end-to-end learning of algorithmic tasks. One such task where NTMs were shown to work was learning N-gram distribution from token sequences. Since this is related to online sequence prediction, our first model was based on NTMs. However, on our real datasets we found NTMs to not be very effective. The reasons perhaps is the controller’s difficulty with adaptively generating keys and values for memory operations. Dynamic-NTMs (DNTMs) 

[Gülçehre et al.2016] alleviate this via fixed trainable keys, and are shown to aid QA tasks but they did not work either as we will show in our experimental section. Like LMNs, DNTMs also propose discrete memory addressing but the keys are trained from input unlike in LMNs where the discrete key is class label that requires no training. Another difference with LMNs is that the memory is very tightly integrated with the neural network and requires joint training.

MANNs for classification

More recent MANNs designed for classification employ a looser coupling [Kaiser et al.2017, Santoro et al.2016] and store hidden vectors and discrete labels as memory contents as in LMNs. However, they differ in how the memory is addressed, updated, and used (shown graphically in Figure 0(a)). First, LMNs treat memory as an on-line learner and update memory only when loss is non zero unlike all previous MANNs that update the memory for every step. This allows memory to be used more effectively to store instances when PCN is weak. Second, in MANNs memory reads are fed back into the main model for getting a prediction. Instead, in LMNs memory reads are loosely combined with the PCN scores allowing us to use memory for plug-and-play adaptation of a trained model. Finally, most MANNs use a global LRU replacement strategy whereas we replace only within in a label based on importance. This reduces the bias against rare classes present in LRU policies.

Model Adaptation by parameter re-learning

Another way to adapt is by training or tuning parameters. The methods of [Rei2015] and [Huang et al.2015] train only a subset of parameters that are local to each sequence. More recently,[Finn, Abbeel, and Levine2017] and [Ravi and Larochelle2017] propose meta-learners that ”learn to learn” via the loss gradient. In general, however such model retraining techniques are resource-intensive. In our empirical evaluation we found these methods to be slower and less accurate than LMNs.

Online Learning

Online learning techniques such as [Shalev-Shwartz, Singer, and Srebro2007] for learning kernel coefficients is relevant if we view the memory vectors acting as the support vectors and the memory scalars as the associated dual variables. Our setup is a little different in that we employ a mix of batch and online learning. Our proposed scheme of memory updates and merge was inspired by the gradient updates in PEGASOS, and in the case of exactly one cell per label reduces to a specific form of budgeted-PEGASOS[Wang, Crammer, and Vucetic2010].

Experiments

We next present empirical comparison of LMNs with recent MANN based meta-learners and state of the art parameter re-learning based adaptive models. We experiment 222code to be available on https://github.com/sshivs/LMN on three different supervised learning tasks that require online model adaptation: sequence prediction on five datasets spanning applications like user trajectory prediction and next click prediction, image classification under online addition of new image labels, and language modeling.

Online sequence prediction

In this task the data consists of a sequence of discrete tokens and the label to be predicted at time is just the next token . The inputs are stateful and therefore the PCN has to be a RNN. We collected five such datasets from different real-life applications. In Table 1 we summarize the average length of each sequence, number of tokens, and the number of sequences in the training and test set.

  • FSQNYC and FSQTokyo are Location Based Social Network data collected by  [Yang et al.2015] from FourSquare of user check-in at various venues over an year.

  • Brightkite [Cho, Myers, and Leskovec2011] is a user check-in dataset made available as part of Stanford Network Analysis Project [Leskovec and Krevl2014].

  • Geolife [Zheng et al.2009] is the trajectory data of people collected over multiple days, and provides the GPS coordinates of people tracked at almost a minute interval. We discretize the locations with a resolution of 100 meters and limit to the densest subset around the city.

  • The Yoochoose dataset [Ben-Shimon et al.2015] is the click event sessions for a major European e-tailer collected over six months.

Name No. of train No of test Avg seq. # Tokens
sequences sequences length or labels
Brightkite 1800 521 238 22811
FSQNYC 670 264 90 8325
FSQTokyo 1555 672 160 12589
Geolife 220 20 8000 31273
Yoochoose 450523 112279 13 39481
Table 1: Statistics of data used for on-line sequence learning
Experimental setups

In all experiments we used the Adam optimizer [Kingma and Ba2014]. The PCN is a GRU and the input is the embedding of the true observed token at the previous time. The number of memory cells is equal to the number of labels.

Methods compared

We evaluate these tasks on six different methods.

  • As a baseline we train a larger LSTM [Hochreiter and Schmidhuber1997] which has roughly 5 times more RNN parameters compared to the PCN used in LMN. This lets us evaluate if a larger LSTM state could substitute for memory.

  • Next we choose two recent approaches from the family of meta-learners that re-tune model parameters: the method of [Rei2015] since it was specifically proposed for sequence prediction and the more recent but generic method MAML of [Finn, Abbeel, and Levine2017]. Both these models used an LSTM of size 50 as the base learner. After each true label is encountered these models compute gradient of the loss with respect to the adapting parameters, and apply those updates, before processing the next input.

  • As a representative of MANNs, we implemented a version of D-NTM [Gülçehre et al.2016] where we made two changes to adapt to the on-line prediction task. First, we use the previous read address as the new write address, and second we derive the new content from read memory content and . We tried several other tweaks, including the unchanged D-NTM and obtained best results with these changes.

  • We compare these with LMN. To illustrate the importance of adaptively reweighting the PCN and memory scores as per Equation 4, we also tried another model called LMN-fixed. In LMN-fixed is fixed for all and and is determined by batched hyper-parameter optimization.

Baseline Parameter-retune Memory-based
Name LSTM Rei MAML DNTM LMN-fixed LMN
Brightkite 10.7 (0.11) 10.01 (0.18) 4.27 (0.47) 9.63 (0.13) 3.88 (0.51) 3.57 (0.55)
FSQNYC 8.95 (0.03) 8.72 (0.07) 6.55 (0.16) 9.01 (0.05) 6.13 (0.25) 5.54 (0.27)
FSQTokyo 8.14 (0.08) 6.95 (0.13) 5.68 (0.23) 7.25 (0.11) 5.40 (0.26) 5.32 (0.28)
Geolife 1.13 (0.84) 1.08 (0.83) - 1.11 (0.82) 1.05 (0.83) 1.08 (0.83)
Yoochoose 5.01 (0.24) 5.05 (0.23) - 5.01 (0.23) 5.01 (0.24) 4.96 (0.25)
Table 2: Log Perplexity and MRR(in parantheses) on online sequence prediction tasks

Results In Table 2 we report the log-perplexity and mean reciprocal rank (MRR) of all six methods on the five datasets. We make the following observations from these.

  1. LMN-based online model adaptation significantly boosts accuracy over a baseline LSTM with five times larger state-space. The datasets vary significantly in their baseline and LMN accuracies. FSQNYC, FSQTokyo and Brightkite datasets have MRR increasing by 100 and 400%. For Yoochoose improvements are smallest because most sequences are very short (13 on average). In Geolife the sequences are much larger but the baseline MRR (0.84) is already high indicating a saturation point.

  2. The improved accuracy of LMN over LMN-fixed illustrates the importance of online reweighting of PCN and memory. We observe improvements over baseline on almost all datasets, even on Yoochoose where none of the other methods did. However, LMN-fixed is better than parameter retuning and DNTMs establishing that even without training the combiner, LMNs can be used as plug-and-play model adapters.

  3. Meta learners that retune parameters for adaptation are indeed effective compared to the baseline. The Rei method that retunes only a designated ’sequence-vector’ parameter is less effective than the MAML that retunes all parameters. Yet, our proposed LMN, (or its simpler LMN-fixed variant) provides an even greater boost. For example, for Brightkite the MRR increases from 0.11 to 0.18 with Rei, to 0.47 with MAML, and to 0.51 with LMN-fixed and 0.55 with LMN. A major shortcoming of MAML is that training the meta-learner is very slow. We were not able to complete the training of MAML on our two largest datasets Geolife and Yoochoose within a reasonable time. In contrast, LMNs are significantly faster as they do minimal re-training. DNTM, another MANN-based approach is not as effective as even LMN-fixed and we believe it is mainly because of the differences in their respective memory organization.

These experiments establish that a well-designed MANN is a practical option for accurate and efficient online adaptation for next-token prediction tasks.

Online Adaptation of Image Classifiers

We next demonstrate online adaptation of an image classifier to new labels observed during testing. Unlike existing work in few-shot learning [Vinyals et al.2016] where each test sequence has its own independent prediction space, we consider the more useful and challenging task where new labels co-exist with labels seen in the batch-trained PCN.

Dataset and setup

We use the popular omniglot dataset [Lake, Salakhutdinov, and Tenenbaum2015]. The base data-set consisted of 1623 hand-drawn characters selected from 50 different alphabets. [Vinyals et al.2016] extended the base data-set by various rotations of the images, and this increases the number of categories to 4515. We create an online variant of this dataset as follows. We arbitrarily select 100 classes as unseen and 250 as seen classes. During training only seen classes are provided, but the test data has a uniform mix of all classes. Each test sequence is obtained by first randomly selecting 5 labels from the 350, and then choosing different input images from the selected labels up to a length of 10. Thus, in each sequence we expect to encounter new labels on average. Accuracy is measured only over second occurrence of each of the five labels much like in one-shot learning experiments. The one major difference is that in our case the prediction space is all the seen labels and the expected new labels, whereas in few-shot learning the prediction space is only the 5 labels chosen for that test sequence. The results are averaged over 100 sequences.

Methods compared

We compare results of LMN with MAML [Finn, Abbeel, and Levine2017] the most recent meta-learner that can work in this setup. We do not compare with the two recent MANNs kaiser2017,SantoroBBWL16 that report Omniglot results on few-shot learning because their techniques do not easily extend to the case where new labels share label space with a pre-trained classifier. For reference we also report accuracy on the baseline classifier that will certainly mis-classify examples from the new class. We use as PCN a convolutional network, with the same configuration as used in  kaiser2017.

.

Model Overall New labels Seen labels
Baseline 45% 0 63%
MAML 56% 49% 59%
LMN 86% 71% 92%
Table 3: Accuracy on Online adaptation for Omniglot
Results

In Table 3 we report our results. The baseline that does no adaptation achieves an overall accuracy of 45% while obtaining an expected accuracy of 0% on the new labels and on the seen labels. MAML boosts the overall accuracy to 56%, while obtaining an accuracy of 49% on the new labels. However, compared to the baseline the accuracy on seen labels has dropped. This is similar to catastrophic forgetting French99 and may be because during meta-learning updates the weights of the shared representation layer deteriorate. The LMN architecture is better able to absorb new labels and changing priors of existing labels. LMN achieves an overall accuracy of 86% with 71% on new labels. The accuracy increases for the seen labels as well because LMN is able to use the memory for storing prototypes of examples with non-zero loss even from seen labels. Thereafter, the combiner RNN can pay more heed to the labels seen during the adaptation phase.

Plain few-shot learning

To enable comparison with the few-shot results of many recent published work, we also report results of LMN on this task. In few-shot learning, the PCN only generates the input embedding and does not assign label scores since each test sequence has its own label space. The memory uses the embedding to assign label scores via Equation 3 based on which prediction is made without involving any combiner either. These experiments will compare LMN’s memory organization and update strategies with state-of-the-art MANNs that have reported results on few shot learning [Kaiser et al.2017].

Our setup is the same as in recent published work  VinyalsBLKW16 on few-shot learning but where the total label space has 4K possibilities labels. This is more interesting and realistic than 5-way classification where many recent methods, including ours, report more than 98% accuracy with just 1-shot.

We present results with changing memory size. We run the experiments with a memory size equal to T-cell per label (T=2,3..). We observe that LMN accuracy is much higher than existing MANNs particularly when memory is limited. For example, at 2-cells per label (roughly 8K memory size) we obtain more than 4% higher accuracy in 1-shot, 2-shot, and 3-shot learning. This proves the superior use of memory achieved via our labeled memory organization

For reference we also compare with parameter retuning-based approach (MAML) Finn2017ModelAgnosticMF. Even though this work reports state-of-art accuracy on 5-way few-shot learning, for 4k way few-shot learning it is not as effective. A softmax layer that has to arbitrate among 4k new classes perhaps need lot more gradient updates than learnable by meta-learners.

Name 1 shot 2 shot 3 shot
Kaiser (2-cell/label) 48.2% 58.0% 60.3%
Our model (2-cell/label) 52.6% 62.5% 64.1%
Kaiser (3-cell/label) 49.7% 60.1% 63.8%
Our model (3-cell/label) 52.8% 63.0% 67.0%
Kaiser (5-cell/label) 52.7% 63.0% 66.3%
Our model (5-cell/label) 54.3% 63.2% 66.9%
MAML 44.2% 46.5% 47.3%
Table 4: Test accuracies for 4k way few shot learning

Language Modeling

Language modeling is the task of learning the probability distribution of words over texts. When framed as modeling the probability distribution of words conditioning on previous text, this is just another online sequence prediction task. The natural dependence on history in this task provides for another use-case of memory. Memory based models have been shown to get improvements over standard RNN based language models  

[Sukhbaatar et al.2015, Merity et al.2016, Grave, Joulin, and Usunier2016]. In the same spirit, we apply LMNs to this task by taking the PCN as a RNN.

We compare our model directly with the recently published neural cache model of  [Grave, Joulin, and Usunier2016] and pointer LSTM of  [Merity et al.2016]. These showcase variant uses of memory to improve the prediction of words that repeat in long text. The baseline model (and PCN) is an LSTM with same parameters as in [Grave, Joulin, and Usunier2016]. We compared on common language datasets Wikitext2 and Text8 with memory sizes 100 and 2000 as used in the previously published work. We used standard SGD as the optimizer.

In Table 5 we report the perplexities we obtained with LMN along with the results reported in published work for other three approaches. As the table demonstrates we achieve state of the art results in Text8 and Wikitext2. In LMN the auto-modulation caused by considering only cases where the PCN is weak is superior to memory cache. This shows up in tests when memory is constrained, when the focus on mis-predicted outputs in LMN allows for boosted recall, efficient memory utilization, and capturing longer contexts, compared to other models.

Name Wikitext2 (100 ) Text8 (2000)
LSTM 99.7 120.9
Pointer LSTM 80.8 -
Neural cache 81.6 99.9
LMN 77.6 91.1
Table 5: Test perplexity for language modeling on Wikitext and Text8 with memory 100 and 2000 respectively.

Conclusion

We extended standard memory models with a label addressable memory module and an adaptive weighting mechanism for on-line model adaptation. LMNs mix limited data with pre-trained models by combining ideas from boosting and online kernel learning while tapping deep networks to learn representations and RNNs to model the evolving roles of memory and pre-trained models. LMNs have some similarities to recent MANNs but has significant differences. First, we have a label addressable memory instead of content based addressing. Second, we use memory to only store content on which primary network is weak. Third, our model has a loose coupling between memory and network, and hence our model can be used to augment pre-trained models at a very low cost. Fourth we use an adaptive reweighing mechanism to modulate the contribution of memory and PCN. This LMN is demonstrated to be extremely successful on a variety of challenging classification tasks which required fast adaptation to input and handling non-local dependencies. An interesting extension of LMNs is organizing the memory not just based on discrete labels but on learned multi-variate embeddings of labels thereby paving the way for greater sharing among labels.

Acknowledgements

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.

References