Deep learning based computer vision system has recently achieved great success and has shown its outstanding performance in many applications, such as object classification[14, 27, 13], semantic segmantation[16, 5], face recongnition[32, 31] etc.
However, the deep learning based computer vision is still struggling when labeled data is scarce[29, 9]. As a comparison, human vision system is so smart that human can efficiently learn new object with great performance from few images. Obviously, current computer vision still lag behind human vision greatly.
Recently, several Meta learning approaches[4, 3, 25, 7, 17, 24, 23, 19]have improved the ability of learning from few data for the computer vision system. Different from the ordinary deep learning approaches which train the network on a distribution of data, the Meta learning approach trains a Meta learner on a distribution of tasks instead of data, so that the Meta learner can generalize well on tasks never seen before which are sampled from this distribution.
Even with the above improvement, previous Meta learner still did not shorten the distance with human vision greatly, which is mainly because: 1) given a specific task with few data, and no attention mechanism[30, 33, 12, 20, 28, 18, 11], these Meta learners are not capable of paying attention to the most distinguishable features of the input images to solve this task; 2) these Meta learners can not well use the past learned knowledge to express the input images into high representations accurately, and they have to update themselves quickly according to the few original high dimensional input RGB images, which is shown as Figure.1(a).
In this paper, we propose our idea that the attention mechanism is important for the Meta learner, and the Meta learner should take full advantage of the past learned knowledge to accurately express the input data into high representations with low dimension, and the Meta learner should update itself according to the high representations instead of the original input data, which is shown as Figure.1(b).
, there are six images belonging to two image classes, 4 labeled images are training data, and 2 unlabeled images are testing data. Now, we are required to category the two unlabeled images, image (c) is probably labeled as 1, due to containing table. This feature is not only the same feature with the other images labeled as 1, but also different from the images labeled as 2. Similarly, image (f) is probably labeled as 2, due to containing tree.
In this example, these images have several features, such as plant, animal, tree, table etc. However, only the feature of containing tree or table is useful for us to category these two images correctly, and we quickly pay attention to the key feature and neglect the others, manifesting the important role of attention mechanism.
It should be noted that it is easy for us to get these meaningful features of these images, because we have learned the knowledge about the world before the few shot learning task, and we well use our knowledge to express these images into high representations exactly. Meanwhile, our expression about these images are constant, but we have to quickly adjust our attention and decision logical to fit this task.
Therefore, we can see two main modules in the few shot learning process: a representation module that is to utilize the past knowledge to express images into meaningful high representations, and an attention based decision logical module that is to update quickly to fit the new few shot learning task.
In the above example, it is obvious that we didn’t do the few shot learning task directly based on the high dimensional RGB images. Instead, we firstly express each image into high representations with lower dimension space, and secondly reduce the representation’s dimension by quickly changing our attention to select the key feature of the high representations. Based on the feature which we select, it is exceedingly easy for us to adjust our decision logic to fit the few shot learning task. As a comparison, previous Meta learning approaches trains the Meta learner to do a much harder work, that is, the Meta learner is forced to quickly adjust its total network to fit the few shot learning task based on the original high dimensional input images.
Inspired by the few shot learning process of human vision system, and to help the Meta learner to do a better work, we propose two methods: 1) Improving Meta learner’s ability of learning from few data by explicitly embedding attention mechanism into it. We call this method as Attention augmented Meta learning(AML). This method makes the Meta learner be capable of paying more attention on the key feature. 2) Easy the Meta learner’s work by separating its total network into two modules: representation module and Attention Augmented Output(AAO) module. The architecture of the network is shown in Figure.1(b)
, and these two modules are trained separately in two stages. The representation module should be pre-trained by supervised learning to learn enough knowledge, and it is equivalent to the same module of human intelligence.
In the Meta learning stage, the Meta learner will be trained to quickly adjust the AAO module based on the high representations that the pre-trained representation module provides. Obviously, the AAO module plays the same role as the attention based decision logical module of human vision system. We call this method as Representation module based and Attention augmented Meta learning(RAML).
The main contributions of our work are:
Analyze the problems of previous Meta learning approaches, and propose our viewpoint that both the attention mechanism and the past knowledge are crucial for the Meta learner, and the Meta learner should be trained on high representations of the input data instead of the original data;
Based on our viewpoints, we design two methods: Attention augmented Meta learning(AML) and Attention augmented Meta learning(RAML).
By lots of experiments, we attained state-of-the-art performance on several few shot learning benchmarks both with methods AML and RAML, showing the rationality of our viewpoint and methods.
2 Related Work
2.1 Meta learning
A N-way, K-shot learning task means there is a support set and a query set for the Meta learner. The support set contains K examples for each of the N classes, and the query set contains L examples for each of the N classes. Meta learning has been shown as a promising way to solve the few shot learning problem, and most of the Meta learning approaches train the Meta learner on the N-way, K-shot learning tasks in the following way: firstly, the Meta learner is required to inner-update itself according to the support set; secondly, the Meta learner exams the effect of the inner-update operation by calculating its loss on the query set, thirdly, by minimizing the loss on the query set, the Meta learner is forced to learn a great weight initializer (the initialized weight is easily inner-updated by simplely gradient descent to perform well on the query set), or a skillful weight updater (accurately inner-updating the Meta learner’s weight ), or both, or to memorize the information of the support set and to perform well on the query set based on the memory.
2.2 Metric Learning
Some researchers have tried to solve the few shot learning problem by metric learning. The principal of these approaches is straight-forward, that is, to train a non-linear mapping function that represents images in an embedding space. After this training, the embedding of images belonging to different classes are easy to be distinguished by simple nearest neighbor or linear classifiers. Matching network
trains the mapping function using a neural network, and categorizes the test images to the class where images have the most similar embedding with them, and the similarity is measured by the Cosine distance between embedding. Prototypical Network
is an approach similar to Matching network, whereas the similarity between embedding is measured by Euclidean distance. Compared with Meta learning approaches, the disadvantage of the above kinds of metric learning based approaches is obvious: they are not easily applicable to other domains, such as reinforcement learning[21, 22] and regression.
3.1 Problem of learning from few data
Learning from few data is extremely difficult for the deep learning based computer vision system. This is mainly because there are too much weight in the deep neural network, and the input data is represented in a large dimension space, usually tens or hundreds of thousands dimension space is required.
For the image classification task, it is difficult for few images, in such a large dimension space of one category, to reflect the characteristic of this category accurately. However, human vision system can get the characteristic of a new category from few images
by firstly expressing them into high representations and secondly paying attention to the key features of the high representations, and both help human to understand the characteristic of the category with few images.
Previous Meta learning approaches help computer vision system to learn from few data a lot. However, they train the Meta learner to quickly adjust its network to fit the few shot learning task directly on the few original high dimensional input images, and ignore the importance of attention mechanism and the past knowledge, which can help the Meta learner to perform well.
In this paper, to counter the problem the previous Meta learning approaches expose, we propose two methods: Attention augmented Meta Learning(AML) and Representaion based and Attention augmented Meta Learning(RAML).
Method AML aims to improve the Meta learner’s attention ability by explicitly embedding an attention model into its network, and the attention model will help the Meta learner to pay attention on the key features.
The network architecture of the Meta learner is shown in Figure.3. An attention model is inserted explicitly, and the forward calculation is shown as Eq. 1. The feature which the CNN outputs is firstly fed into the attention model, and the final feature is the channel-wise multiplication between the attention mask m and the original feature , and the classifier output the final prediction p.
Where , , , are the functions of the CNN, attention model, channel-wise multiplication and the classification layer respectively. Moreover, x is the input data, is the weight of the CNN, is the weight of the attention model, and p is the final prediction of the classification layer with the weight .
By embedding the attention model explicitly, the Meta learner will be more capable to adjust its attention to the more useful feature in the few shot learning task, and will help the classifier to do a better few shot learning work. Corresponding experiments show the positive effect of attention mechanism, and we analyze the feature distribution of and in the feature analysis section. It is clear that is more distinguishable than .
Method RAML aims to give the Meta learner the ability of both well using the past learned knowledge and the attention mechanism. To achieve that, we seperate the Meta learner’s network into two modules: representation module and AAO module. The representation module is used to learn knowledge on the other dataset by supervised learning, e.g. MiniImagenet-900 dataset(a dataset we organized to pre-train the representation module, and the detail about it will be introduced in the experiment section). The AAO module is the module which also embeds an attention model and can be adjusted efficiently to fit the new few shot learning task by the Meta learner.
The network structure of Meta learner is shown in Figure.4. The Pre-trained Classification(PC) module does not belong to the Meta learner, and it is only used to pre-train the representation module. The Meta learner is composed by the representation module and the AAO module, and the training process can be separated into two stages: pre-training stage and Meta training stage.
In the pre-training stage, the representation module and the PC module will be trained on classification task on the MiniImagenet-900 dataset.
After the pre-training stage, the representation module has learned knowledge from the MiniImagenet-900 dataset, and by utilizing the learned knowledge, the Meta learner can express the original input image into high representations which are suitable to differentiate different image classes. In method RAML, the representation module can be built up by many kinds of network. We use the ResNet-50 network as the representation module in our paper.
In the Meta training stage, for the Meta learner not forgetting the learned knowledge, we fix the pre-trained representation module totally, and the Meta learner only needs to learn how to solve the few shot learning task by quickly adjusting its AAO module based on the low dimensional meaningful features provided by the representation module, which is a simpler work compared with that of the Meta learner in AML method.
It should be noted that the dataset used in the pre-training stage is different with that in the meta-training stage. In the meta training stage, the Meta learner is trained on the MiniImagenet dataset, whereas in the pre-training stage, representation module of the Meta learner is trained on MiniImagenet-900 dataset, and there is no image class overlaps between these two dataset.
3.4 Attention model
In this paper, we use soft attention mechanism to build up the attention model, and the Figure4 is used to better understand the processing of the soft attention mechanism for Meta learner. Although the soft attention mechanism is not same with the attention mechanism in human vision system, it still plays similiar role with the human attention mechanism and will help the Meta learner to pay attention to the key features. The inner structure of attention model and the shape of corresponding features are shown in Figure.5, and the computation process of attention model is shown as Eq.2. The feature is firstly global-average-pooled to get feature , and then a convolution layer coupled with a sigmoid activation layer are used to calculate the attention mask m from the feature . The attention model can be seen as a network that predict the importance of each channel of the feature map and get the attention mask m by analyzing the total feature of the input data.
Where is the global-average-pooling function, and the
is the sigmoid activation function.
Our attention model is different with SENet’s. In one aspect, our purpose of using attention model is to improve the Meta learner’s ability to quickly adjust itself to pay attention to the important features of high representations, whereas SENet uses attention mechanism through all blocks. In another aspect, attention model’s network structure is different, and we have found that SENet’s attention model works not as good as ours in the few shot learning problem.
In this section, we will present results and some details of our experiments, and the dataset we used. More details about our experiments will be provided in the supplementary material. All our experimental code is based on the Tensorflow library.
is a dataset that popularly used for evaluating the performance of Meta learning algorithm, it contains totally 100 image classes, including 64 training classes, 12 validation classes, and 24 testing classes. Each image class with 600 images are sampled from the ImageNet dataset.
Omniglot is another widely used dataset for few shot learning problem, it contains 50 different alphabets and totally 1623 characters from these alphabets, and each character has 20 images that hand drawn by 20 different people.
MiniImagenet-900 dataset is designed to pre-train the representation module in method RAML, it is composed of 900 image classes. Each image class and the corresponding images is collected from the original ImageNet dataset, and each image class contains about 1300 images. It should be noted that there is no image class in MiniImageNet-900 is coincided with the classes from the MiniImagenet dataset.
is a dataset used for scene classification. In this paper, we also validate method RAML by pre-training the representation module on Place2 dataset.
In our work, for a fairly comparison with the previous few shot learning and Meta learning methods, we resize all the images of MiniImagenet, MiniImagenet-900 and Places2 to 84*84 resolution, and resize all the images of Omniglot to 28*28 resolution.
4.2 Experiment on MiniImagenet
On the MiniImagenet dataset, we test our method AML and RAML, and both of the two methods work very well and attain several state of the art performance on this dataset.
4.2.1 AML experiment
In the method AML, we improve the Meta learner’s ability of learning from few data by explicitly embedding attention mechanism into its network, and train it by Meta-SGD approach. The structure of the Meta leaner’s network is shown in Figure.3, both the network of the attention mechanism and the classifier is a simple fully-connect layer. We set the hyper-parameter of K for 5way 1shot and 5way 5shot tasks to 1 and 5 respectively, whereas the L (image number of each class in query set) is always set to 15.
The experimental result of our method AML on the MiniImagenet is shown in Tab. 2, we attained the state-of-the-art on the 5-way 5-shot image classification task: 69.46%(compared with original Meta-SGD, we rise up the Meta learner’s performance by 8.5%).
4.2.2 RAML experiment
In the method RAML, we give the Meta learner the ability of both utilizing the past knowledge and the attention mechanism by dividing the Meta learner’s total network into representation module and AAO module, and these two modules are trained in different ways in two stages: pre-training stage and the Meta training stage.
In the pre-training stage of RAML method, the representation module and PC module were trained together by supervised learning with MiniImagenet-900 dataset. The representation module is a modified ResNet-50 network which can be fed with the image of 84*84 resolution in RGB color space, and the PC module is a simple fully-connect layer followed by a Softmax-output layer. In the Pre-training stage, we set the batch size to 256, and the learning rate to 0.001, and decay the learning rate to 0.0001 after 30000 iterations, and use L2 normalization and dropout operation to prevent over-fitting. Finally, the representation module together with the PC module get a 57.39% classification accuracy on the 900 way classification task on the MiniImagenet-900 dataset.
After the pre-training stage, in the Meta training stage, we fix the weights of the representation module, and train the AAO module by Meta-SGD algorithm. The hyper-parameters of the Meta training process is shown in Tab.1. Experimental result is shown in Tab.2. Compared with method AML, method RAML improves the Meta learner’s performance more greatly, the accuracy of 5-way 1-shot task rises up from 52.25% to 63.66%, and the accuracy of 5-way 5-shot task rises up from 69.46% to 80.49%.
The most likely reason why RAML performs so well is: before the meta training stage, the representation module has learned the knowledge and the ability to understand the input image, and provides meaningful high representaions and features of the input image. In the Meta training stage, by fixing the representation module, the Meta learner’s work becomes easier because it only needs to learn how to quickly adjust the AAO module according to the meaningful features the representation module provided, and do not need to take care of the original high dimensional input data. While the Meta learner trained by AML has to adjust the total network when faced with a new few shot learning task, and has to adjust its network and make decision according to the original high dimensional input data, which is a harder work than that of the Meta learner of RAML.
|Meta learning rate||0.001||0.001|
|total training tasks||200000||200000|
|total val/test tasks||600||600|
|matching nets FCE||NIPS-16||44.20%||57.00%|
|Meta learner LSTM||ICLR-17||43.440.77%||60.60 0.71%|
|Reptile + Transduction||/||49.970.32%||65.990.58%|
Few shot learning performance on MiniImagenet dataset. The accuracy is averaged by the accuracies on 600 few shot classification tasks, with 95% confidence intervals, and all these 600 tasks are randomly generated from the test set of the MiniImagenet dataset. We highlight the best result and the second best result on each task.
|Method||Venue||5-way Accuracy||20-way Accuracy|
|matching nets FCE||NIPS-16||98.10%||98.90%||93.80%||98.50%|
|Reptile + Transduction||/||97.680.04%||99.480.06%||89.430.14%||97.120.32%|
4.3 Experiment on Omniglot
For the Omniglot dataset, we test AML on 5way 1shot, 5way 5shot, 20way 1shot, 20way 5shot tasks. By referencing to MAML, the architecture of the network we used in the Omniglot experiment is similar to that in the MiniImagenet dataset experiments. The hyper-parameter of the Meta training process is shown in Tab.1, the Meta batch size is set to 32 for 5way tasks, and 16 for 20way tasks. The experimental results is shown in Tab.3
It is clear that in the 4 few shot image classifacation tasks, our method AML attain state-of-the-art performance on 3 of these 4 tasks. Especially on the 20-way 1-shot task, our method AML surpass other methods by a large margin (ompared with the result of original Meta-SGD, AML improves the Meta learner’s performance from 95.93% to 98.48%).
4.4 Ablation study
In this section, we will confirm the reliability of our methods by ablation experiments.
|Method||5-way Accuracy||20-way Accuracy|
4.4.1 Ablation study of AML
Firstly, to confirm the promotion effect of attention mechanism for the Meta learning algorithms, we do a lot of experiments to compare the performance of the Meta learner which is attention augmented with its counterpart which is not. The experimental results show in Tab.5 and Tab.4. The performance of the method which has a * mark is the result re-implemented by ourselves. There are some difference between the result of the corresponding paper with that of our re-implementation, which is probably caused by different hyper-parameters or different experiment setting. The comparison results revealing that in most cases, the attention mechanism improves the Meta learner by a clear margin, demonstrating the reasonablility of our idea and method AML.
Furthermore, to test whether the method AML is universal to other Meta learning approaches, we change the Meta learning algorithm in AML from Meta-SGD to MAML, and all the network and hyper-parameters are constant. We mark it as AML(MAML). Corresponding result is shown in Tab. 6. Though the performance of AML(MAML) drop down slightly, it is also comparable, which indicates that our method AML generalizes well to different Meta learning approaches.
As attention mechanism will bring more weights into the Meta learner’s network, we do another experiment to validate that the improvement of AML is not caused by the growth of number of weights but the contribution of attention mechanism. The experiment detail is: since the attention model is mainly a convolution layer with kernel size of 1*1, we remove the attention model, and place a convolution layer with kernel size of 1*1 on the top of the feature (shown in Fig.3). We name the Meta learner with this network as OML (Ordinary Meta learning), and its number of weight is the same with that of AML. Corresponding experimental result is shown in Tab.6, and it is clear that OML lags behind AML, which shows that the improvement effect of AML is not caused by the growth of number of weight but the contribution of attention mechanism.
4.4.2 Ablation study of RAML
Similiarly to AML, we also test whether RAML can generalize well to other Meta learning approaches. We validated it by training the AAO module with MAML approach in the meta training stage. We mark it as RAML(MAML), and the experimental result is shown in Tab.6.It is obviously that our method RAML also generalizes well to different Meta learning approaches.
We do another experiment to test how the dataset that used in the pre-training stage affects the Meta learner. We do this experiment by pre-training the representation module on the Places2 dataset, and all the other experiment settings and hyper-parameters are constant with primordial RAML, we mark it as RAML(Places2). Corresponding experimental result shows in Tab.6. It is clear that the dataset used in the pre-training stage affects the Meta learner. The possible reason is that different dataset used in the pre-training stage will lead the representation module to learn different knowledge and features of the input data, and the places2 dataset is a dataset commonly used for scene classification, which result in the representation module to learn the knowledge about scene, and features which it outputs are more suitable for the scene classification task rather than the object classification task.
4.5 Feature analyse
To understand the effect of attention mechanism, we reduce the feature and (shown in Fig.3 and Fig.4) into a 2 dim space with PCA algorithm, and visualize them on a 2D plant. As shown in Fig.6, we visualize and of the Meta learner trained on 5way 1 and 5 shot tasks with method AML and that with the method RAML, and each picture contains 500 feature points which represents 500 images of the query set. It is clear that the distribution of is more distinguishable between different image classes than , and the standard deviation of the inner-class distance becomes smaller and that of the inter-class distance becomes larger compared with that of . The reason of this phenomenon is simple: the attention mechanism makes the Meta learner pays more attention on the key feature, and the key feature will affect the more, which makes be more distinguishable than to differentiate images of different classes.
In this paper, aiming to improve the computer vision system the ability of learning from few images, we analyze the problems of previous Meta learning approaches, and proposed our viewpoint that the attention mechanism is very helpful for Meta learner, besides, the Meta learner should be trained on the high representations of the input image instead of the original high dimensional RGB image, and it should own the ability of leveraging the past learned knowledge to accurately express the input image into high representaions. Based on our viewpoint, we design two methods: AML and RAML. Both of our methods work successful, and attain state-of-the-art performance on several few shot learning benchmarks, and revealing the reliability of our viewpoint and methods.
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, et al.
Tensorflow: a system for large-scale machine learning.In OSDI, volume 16, pages 265–283, 2016.
D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba.
Network dissection: Quantifying interpretability of deep visual
Computer Vision and Pattern Recognition, pages 3319–3327, 2017.
-  S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pages 6–8. Univ. of Texas, 1992.
-  Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. Université de Montréal, Département d’informatique et de recherche opérationnelle, 1990.
-  K. J. Dai and Y. L. R-FCN. Object detection via region-based fully convolutional networks. arxiv preprint. arXiv preprint arXiv:1605.06409, 2016.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
-  C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
-  V. Garcia and J. Bruna. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043, 2017.
-  S. Gidaris and N. Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4367–4375, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  S. A. Hillyard, E. K. Vogel, and S. J. Luck. Sensory gain control (amplification) as a mechanism of selective attention: electrophysiological and neuroimaging evidence. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 353(1373):1257–1270, 1998.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 7, 2017.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. 1(2):3, 2017.
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.In Advances in neural information processing systems, pages 1097–1105, 2012.
-  B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum. One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 33, 2011.
-  Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. In Computer Vision and Pattern Recognition, pages 4438–4446, 2017.
-  Z. Li, F. Zhou, F. Chen, and H. Li. Meta-sgd: Learning to learn quickly for few shot learning. arXiv preprint arXiv:1707.09835, 2017.
-  G. Logan, D. Dagenbach, and T. Carr. Inhibitory processes in attention, memory and language. Academic Press, San Diego, pages 189–239, 1994.
-  N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner. 2018.
-  V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems, pages 2204–2212, 2014.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
-  A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. 2018.
-  S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. 2016.
-  J. Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992.
-  J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, volume 4, page 12, 2017.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
-  O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
-  F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. arXiv preprint arXiv:1704.06904, 2017.
-  F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Normface: l 2 hypersphere embedding for face verification. In Proceedings of the 2017 ACM on Multimedia Conference, pages 1041–1049. ACM, 2017.
-  M. Wang and W. Deng. Deep face recognition: A survey. arXiv preprint arXiv:1804.06655, 2018.
-  T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 842–850, 2015.
-  F. S. Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare: Relation network for few-shot learning. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018.
-  B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.