Meta-Learning via Feature-Label Memory Network

10/19/2017 ∙ by Dawit Mureja, et al. ∙ 0

Deep learning typically requires training a very capable architecture using large datasets. However, many important learning problems demand an ability to draw valid inferences from small size datasets, and such problems pose a particular challenge for deep learning. In this regard, various researches on "meta-learning" are being actively conducted. Recent work has suggested a Memory Augmented Neural Network (MANN) for meta-learning. MANN is an implementation of a Neural Turing Machine (NTM) with the ability to rapidly assimilate new data in its memory, and use this data to make accurate predictions. In models such as MANN, the input data samples and their appropriate labels from previous step are bound together in the same memory locations. This often leads to memory interference when performing a task as these models have to retrieve a feature of an input from a certain memory location and read only the label information bound to that location. In this paper, we tried to address this issue by presenting a more robust MANN. We revisited the idea of meta-learning and proposed a new memory augmented neural network by explicitly splitting the external memory into feature and label memories. The feature memory is used to store the features of input data samples and the label memory stores their labels. Hence, when predicting the label of a given input, our model uses its feature memory unit as a reference to extract the stored feature of the input, and based on that feature, it retrieves the label information of the input from the label memory unit. In order for the network to function in this framework, a new memory-writingmodule to encode label information into the label memory in accordance with the meta-learning task structure is designed. Here, we demonstrate that our model outperforms MANN by a large margin in supervised one-shot classification tasks using Omniglot and MNIST datasets.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning is heavily dependent on big data. Traditional gradient based neural networks require extensive and iterative training using large datasets. In these models, training occurs through a continuous update of weight parameters in order to optimize the loss function during training. However, when there is only a little data to learn from, deep learning is prone to poor performance because traditional networks will not acquire enough knowledge about the specific task via weight updates, and hence, they fail to make accurate predictions when tested.

(a) Task setup
(b) Encoding and Retrieving
Figure 1: Meta-learning task structure. (a) Omniglot images , , are presented along with labels in a temporally offset manner. At time step , the network sees an input image with a label from the previous time step. Labels are also shuffled from episode to episode. This prevents the model from learning sample-class bindings via weight updates, instead it learns to regulate input and output information using its two memories. (b) Here is how the model works. When the network sees an input image for the first time at a certain time step, it stores a particular feature of the input in the feature memory. When the appropriate label is presented at the next time step, the network stores the label information of the input in the label memory. Then, sample-class bindings will be formed between the two memories. When the network is given the same class of image in later time step, the network retrieves the input feature from the feature memory and uses the retrieved information to read the corresponding label memory for prediction.

Previous works have approached the task of learning from few samples using different methods such as probabilistic models based on Bayesian learning [Fei-Fei, Fergus, and Perona2006]

, generative models using probability density functions

[Lake et al.2011, Rezende et al.2016], Siamese neural networks [Koch2015], and meta-learning based memory augmented models [Santoro et al.2016, Vinyals et al.2016].

In this work, we revisited the problem of meta-learning using memory augmented neural networks. Meta-learning is a two-tiered learning framework in which an agent learns not only about the specific task, for instance, image classification, but also about how the task structure varies across target domains [Christophe, Ricardo, and Pavel2004, Santoro et al.2016]. Neural architectures with an external memory such as Neural Turing Machines (NTMs) [Graves, Wayne, and Danihelka2014] and memory networks [Weston, Chopra, and Bordes2014] have shown the ability of meta-learning.

Recent memory augmented neural networks for meta-learning such as MANN [Santoro et al.2016] use a plain memory matrix as an external memory. In these models, input data samples and their labels are bound together in the same memory locations.In models such as the MANN, the input data samples and their appropriate labels from previous step are bound together in the same memory locations. This often leads to memory interference when performing a task as they have to retrieve a feature of an input from a certain memory location and read only the label information bound to that location.

Our primary contribution in this work is designing a different version of NTM [Graves, Wayne, and Danihelka2014] by splitting the external memory into feature and label memories to avoid any catastrophic interference. The feature memory is used to store input data features and the label memory is used to encode the label information of the inputs. Therefore, during testing, ideal performance in our model requires using the feature memory as a reference to accurately retrieve the stored feature of an input image and effectively reading the corresponding label information from the label memory. In order to accomplish this, we designed a new memory writing module based on the meta-learning task structure that monitors the way in which information is written into the label memory.

2 Related Work

Our work is based on a recent work done by a bibfile1 They approached the problem of one shot learning with the notion of meta learning and suggested a Memory Augmented Neural Network (MANN). MANN is an implementation of NTM [Graves, Wayne, and Danihelka2014] with an ability to rapidly assimilate new data, and use this data to make accurate predictions after a few samples.

In previous implementation of NTM, memory was addressed both by content and location. However, in their work, they presented a new memory access module. This memory access module is called Least Recently Used Access (LRUA)[Santoro et al.2016]. It is a pure content-based memory writer that writes memories either to the least recently used location or to the most recently used location of the memory. According to this module, new information is written into rarely used locations (preserving recently encoded information) or it is written to the last used location (to update the memory with newer, and possibly relevant, information).

3 Task Methodology

In this work, we used a similar task structure used in recent works [Santoro et al.2016, Vinyals et al.2016]

. As we implemented supervised learning, the model is tasked to infer information from a labelled training data. This involves presenting the label

along with input at time step . However, in our work, the training data was presented in the following manner: , where is the dataset, is the input at time step and is the class label from previous time step . Therefore, the model sees the following input sequence: (Figure 1(a)).

Moreover, the label used for a particular class of input images in a certain episode is not necessarily the same as the label used for the same class of input images in another episode. Random shuffling of labels is used from episode to episode in order to prevent the model from slowly learning sample-class bindings in its weights. Instead, it learns to store input information into the feature memory and store the corresponding output information into the label memory, when presented at the next time step, after which sample-class bindings, between the input features in the feature memory and the class labels in the label memory, will be formed for later use (Figure 1(b)).

4 Memory Augmented Model

Neural Turing Machine (NTM) [Graves, Wayne, and Danihelka2014]

is a memory augmented neural network that has two main components: a controller and an external memory. It can be seen as a differentiable version of a Turing machine. The controller is a neural network that provides an internal representation of the input used by read and write heads to interact with the external memory. It can be either feed-forward or recurrent neural network.

In this work, we designed a memory augmented neural network, a different version of NTM, with its memory split into partitions: Feature memory () and Label memory (

). The feature memory is used as a reference memory to retrieve the stored representation of an input data. The label memory is used to read an output information of the input based on the retrieved information from the feature memory. In our model, we used Long Short Term Memory (LSTM)

[Hochreiter and Schmidhuber1997] as a controller due to its better performance compared to other controller models. Figure 2 shows the high-level diagram of our model.

Figure 2: Feature-Label Memory Network (FLMN). It a memory augmented neural network with LSTM controller and two memories (Feature and Label memories). Input features are encoded into the feature memory using the feature memory write head. Labels of the inputs are written into the label memory using the label memory write head. The two write heads are linked recursively in accordance with the task structure. Label read head is used to read label information of a given input from the label memory.

Feature-Label Memory Network (FLMN) has two memories, and hence, has two write heads. Feature memory write head writes into the feature memory (). Label memory write head is a writer to the label memory (). Even though information is encoded in both memories, output information is read only from the label memory using the label read head.

Here is how our model works. Given some input at time step

, the controller produces three interface vectors,

, and . Key vector () is used to retrieve a particular memory, , from a row of the feature memory; i.e. . Add vectors ( and ) are used to modify the content of feature memory () and label memory (), respectively.

4.1 Reading from the Label Memory

Before the output information of the input image is read from the label memory, the corresponding feature of is retrieved from the feature memory using a key . When retrieving memory, the row of the feature memory () is addressed using cosine similarity measure,

(1)

This measure, , is then used to produce read-weight vector () whose elements are computed according to the following softmax:

(2)

The read weights are then used to read from label memory (. The read memory, , is computed as follows,

(3)

4.2 Writing into the Feature Memory

In order to write into the feature memory, we implemented the LRUA module [Santoro et al.2016]

with slight modifications. According to this module, new information is written either into rarely used locations or to the last used location. The distinction between these two options is accomplished by an interpolation using usage weight vector

.

The usage weight vector at a given time step is computed by decaying the previous usage weights and adding the current write weights of the feature memory and read weights as follows,

(4)

where, is a decay parameter.

In order to access the least-used location of the feature memory, least-used weight vector is defined from the usage weight vector ,

(5)

Write weights for the feature memory () are then obtained by using a learnable sigmoid gate parameter to compute a convex combination of the previous read weights and previous least-used weights.

(6)

where, and is a scalar gate parameter to interpolate between weights.

Therefore, new content is written either to the previously used memory (if is 1) or the least-used memory (if is 0). Before writing into the feature memory, the least used location of the memory is cleared. This can be done via element-wise multiplication using the least-used weights from the previous time step:

(7)

Then writing into memory occurs in accordance with the computed weight vectors using the feature add vector () as follows,

(8)

4.3 Writing into the Label Memory

According to (3), the read memory is retrieved from the label memory using the read weights with the elements computed using (2) which involves the feature memory . Hence, the label memory should be written in a similar manner as the feature memory so that when an input image is provided to the network at time step , the network retrieves the stored feature of the input from and based on that feature, it extracts the label of the input image from .

In order to accomplish the above scenario, we designed a new memory writing module for the label memory. The new module is based on the task setup in which the model was trained. As mentioned earlier, during training, the model sees the following input sequence: . The label at time step is the appropriate label for the input which was presented along with the label at time step . Based on this observation, we designed a recursive memory writing module.

According to this module, the label memory write-weight vector at time step is computed from the previous feature memory write-weight vector in a recursive manner as follows,

(9)

The label memory () is then written according to the write weights using the label add vector .

(10)

This memory is then read as shown in (3) to give a read memory,

, which will be used by the controller as an input to a softmax classifier, and as an additional input for the next controller state.

Based on this module, the label at time step will be written into the label memory in the same manner as the input (from the previous time step ) was written into the feature memory. This enhances the model to accurately retrieve input information from the feature memory and use this feature to effectively read the corresponding output information from the label memory without any interference.

5 Experimental Results

We tested our model in one-shot image classification tasks using Omniglot and miniMNIST datasets. The omniglot dataset consists of 1623 characters from 50 different alphabets. The number of samples per each class (character) is 20. The dataset is also called MNIST transpose due to the fact that it contains large number of classes with relatively few data samples per class. This makes the dataset ideal for one-shot learning.

5.1 Experiment Setup

In this work, we implemented both our model and MANN [Santoro et al.2016] and compared their performance in supervised one-shot classification tasks. However, the experimental settings we used for implementing MANN are slightly different from the implementation of MANN in the original paper [Santoro et al.2016].

(a) Experiment I. Training accuracy for MANN
(b) Experiment I. Training accuracy for FLMN
Figure 3: Omniglot classification. No data augmentation was performed. In (a) and (b), each episode contains 5 classes and 10 samples per each class. As expected, the instance accuracy is quite low in both models. This is because the models have to do a random guess for the first presentation of the class. However, as we can see from (b), FLMN instance accuracy is more than a blind guess, especially after 20,000 episodes, which indicates that the model is making an educated guess for new classes based on the previous classes it has already seen. For the and other instances, both models use their memory to achieve better accuracy. The instance accuracy of FLMN has reached 80% with in the first 20,000 episodes while the instance accuracy of MANN reached only to 40% accuracy.
(a) Experiment II. Training accuracy for MANN
(b) Experiment II. Training accuracy for FLMN
Figure 4: Omniglot classification. Data augmentation was performed via rotating and translating random character images in an episode. Each episode contains 5 classes and 10 samples per each class. As we can see from (a) and (b), our model has outperformed MANN by displaying better training accuracies for each instances.

In the paper, the number of reads from the memory used was four. Data augmentation was performed by randomly translating and rotating character images. New classes were also created through , and rotations of existing data. A minibatch size of 16 was used.

In our case, one read from memory was used. In order to make a fair comparison, we tried to balance the memory of the two models. we used an x memory matrix for MANN, where is the number of memory locations, and is the size at each location. For our model, we split the memory into two and we used x memory matrix for each memory. Using these settings, we performed three types of experiments.

5.2 Experiment: Type 1

In the first experiment, the original omniglot dataset was used without performing any data modification. Out of the 1623 available classes, 1209 classes were used for training and the rest 414 classes were used for testing the models. Note that these two sets are disjoint. Therefore, after training, both models were tested with never-seen omniglot classes. For computational simplicity, image sizes were down scaled to . One-hot vector representations were used for class labels and training was done using 100,000 episodes. Several experiments were performed for different number of classes (and different number of samples per each class) in an episode. Figure 3 shows the training accuracy of the models for 5 classes and 10 samples (per each class) in an episode.

As we can see from Figure 3, our model has outperformed MANN in making accurate predictions. The instance accuracy of our model has reached nearly accuracy within the first 20,000 episodes of training, while the instance accuracy of MANN could only reach accuracy.

5.3 Experiment: Type 2

In our second experiment, we performed data augmentation without creating new classes. The dataset was augmented by rotating and translating random character images of an episode. The angle for rotation was chosen randomly from a uniform distribution

with a size of an episode. This was accompanied by a translation in the x and y dimensions with values uniformly sampled between -10 and 10 pixels. Images were then downscaled to

In a similar manner as the previous experiment, 1209 classes (plus augmentations) were used for training and 414 classes (plus augmentations) were used for testing. Figure 4 shows the training accuracy of MANN and our model for 100,000 episodes.

Not only has our model performed better in making accurate predictions but also has learned faster than MANN. This can be shown by plotting the loss graph of training for the two experiments (Figure 5).

(a) Experiment I. Loss graph for MANN and FLMN
(b) Experiment II. Loss graph for MANN and FLMN
Figure 5: Loss graph. For the same learning rate, the loss graph of FLMN has fallen sharply compared to the loss graph of MANN in both (a) and (b). This indicates that FLMN has learned the task of one-shot classification faster than MANN and explains why it has demonstrated higher training accuracy within the first few episodes.

In both types of experiments, the training process has stopped at the mark of the 100,000 episode. Without any further training, the models were tested with never-seen omniglot classes from the testing set. The testing results are summarized in the Table 1. We borrowed the test result of MANN from bibfile1 for a reference.

As we can see from the table, our model has demonstrated higher classification accuracy in both experiments compared to MANN. FLMN has reached an accuracy of 85.6% (Experiment I) and 86.5% (Experiment II) on just second presentation of an input sample from a class with in an episode reaching up to 94.1% and 94.4% accuracy by the instance, respectively. On the other hand, MANN achieved an accuracy of 66.7% (Experiment I) and 65.5% (Experiment II) in the instance reaching up to 78.1% and 77.2% accuracy by the instance, respectively.

INSTANCE (% CORRECT)
Model
MANN [Santoro et al.2016] 36.4 82.8 91.0 92.6 94.9 98.1
MANN (Experiment I) 21.6 66.7 74.0 76.0 76.3 78.1
FLMN (Experiment I) 31.1 85.6 88.7 89.5 91.0 94.1
MANN (Experiment II) 22.2 65.5 72.0 74.5 76.4 77.2
FLMN (Experiment II) 33.9 86.5 89.1 89.7 91.1 94.4
Table 1: Test-set classification accuracies of MANN and FLMN for Experiment I and Experiment II

5.4 Zero-shot learning

In this experiment, the models were tasked to perform MNIST classification after being trained with omniglot dataset. We used 1209 classes of omniglot dataset for training. For testing, we prepared a miniMNIST dataset. miniMNIST contains only 20 image samples per each class which are randomly selected from the original MNIST dataset. The images were downscaled to . After 100,000 episodes of training, the models were tested with never-seen MNIST classes. Testing results are summarized in the following table.

INSTANCE (% CORRECT)
Model
MANN 14.6 37.3 52.0
FLMN 28.5 67.6 80.5
Table 2: Test-set classification accuracies of MANN and FLMN for zero-shot learning

As we can refer from Table 2, FLMN was able to achieve accuracy on the instance in classifying never-seen-before images from miniMNIST dataset after being trained with omniglot dataset.

6 Conclusion

In this paper, we implemented meta-learning framework and proposed Feature-Label Memory Network (FLMN). The novelty of our model is that it stores input data samples and their matching labels into separate memories preventing any memory interference. We also introduced a new memory writing method associated with the task structure of meta-learning. We have shown that our model has outperformed MANN in supervised one-shot classification tasks using Omnigot and miniMNIST datasets. Future work includes testing our model with more complex datasets and experimenting the performance of our model in other tasks.

References

  • [Christophe, Ricardo, and Pavel2004] Christophe, G.-C.; Ricardo, V.; and Pavel, B. 2004. Introduction to the special issue on meta-learning.

    Introduction to the special issue on meta-learning. Machine learning,

    54(3):187–193.
  • [Fei-Fei, Fergus, and Perona2006] Fei-Fei, L.; Fergus, R.; and Perona, P. 2006. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28(4):594–611.
  • [Graves, Wayne, and Danihelka2014] Graves, A.; Wayne, G.; and Danihelka, I. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Computation 9(8):1735–1780.
  • [Koch2015] Koch, G. 2015. Siamese neural networks for one-shot image recognition. PhD thesis, University of Toronto.
  • [Lake et al.2011] Lake, B. M.; Salakhutdinov, R.; Gross, J.; and Tenenbaum, J. B. 2011. One shot learning of simple visual concepts. In Proceedings of the Annual Conference of the Cognitive Science Society 72:2.
  • [Rezende et al.2016] Rezende, D. J.; Mohamed, S.; Danihelka, I.; Gregor, K.; and Wierstra, D. 2016. One-shot generalization in deep generative models. In Proceedings of the International Conference on Machine Learning, JMLR:W&CP 48.
  • [Salakhutdinov, Tenenbaum, and Torralba2012] Salakhutdinov, R.; Tenenbaum, J.; and Torralba, A. 2012. One-shot learning with a hierarchical nonparametric bayesian model.

    Proceedings of ICML Workshop on Unsupervised and Transfer Learning, PMLR

    27:195–206.
  • [Santoro et al.2016] Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; and Lillicrap, T. 2016. Meta-learning with memory-augmented neural networks. In Proceedings of The 33rd International Conference on Machine Learning, 1842–1850.
  • [Vinyals et al.2016] Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; and Wierstra, D. 2016. Matching networks for one shot learning. In Advances in Neural Information Processing Systems 3630–3638.
  • [Weston, Chopra, and Bordes2014] Weston, J.; Chopra, S.; and Bordes, A. 2014. Memory networks. arXiv preprint arXiv:1410.3916.
  • [Woodward and Finn2017] Woodward, M., and Finn, C. 2017. Active one-shot learning. arXiv preprint arXiv:1702.06559.