Deep learning based artificial intelligence has shown remarkable progresses in many computer vision tasks[16, 31, 6, 22, 11, 2]. However, it still extremely falls behind human intelligence in the aspect of learning from few images. Few-shot image classification (FSIC)[40, 32, 29, 8, 21], which aims deep models to recognize unseen categories by learning from few images of these categories, becomes a more and more popular problem.
is to train a non-linear mapping function that represents images into an embedding space. After training, the embeddings of images belonging to different classes are easy to be distinguished by classifying the embeddings using the nearest neighbor[38, 33] or linear classifiers.
Recently, meta-learning based FSIC methods [13, 3, 28, 4, 30, 8, 27, 21, 7, 1, 17, 9, 37] are more popular. Instead of training a model on images, they commonly train a meta-learner on FSIC tasks for the meta-learner learning a universe easy fine-tuning initial weight for FSIC tasks[8, 24, 13, 23]. In each FSIC task, the meta-learner is required to recognize new image categories by fine-tuning itself based on few images of these categories. Typically, FSIC task is called N-way K-shot classification task, and the task is constructed with a support set for learners (common deep models or meta-learners) fine-tuning and a query set for evaluation. N-way means the task contains N unseen categories for the learner to recognize and K-shot means the support set contains K samples for each of the category for the learner to fine-tune.
In this paper, we discover an interesting phenomenon of these meta-learning based methods. As shown in Fig.1, when fine-tuning on the support set, compared with deep model, the meta-learners pay more attention to update its top layer. For example, the update proportions of the top (5-th) layer of the MAML and TAML are approximately 73% and 83%, while that of the deep model is 55%. For easily understanding, we formulate the update proportion as
where denotes the update proportion of the -th layer, and represent the initial weight and updated weight after (10 in this work) update steps of the -th layer, respectively. is L2 norm and denotes the number of layers. Note that, in this paper, update denote the meaning that the meta-learner fine-tunes its weight on the support set of a FSIC task.
Inspired by the above phenomenon, we make a study on the meta-learning based FSIC and contribute to the few-shot image classification community in three folds.
1) We assume that in the FSIC scene, the meta-learner may greatly prefer updating its top layer to updating its bottom layers for better performance. In other words, a better-suited layer-wise updating rule may is favored by the meta-learner. To validate the assumption and improve the meta-learner’s FSIC performance, we design a novel layer-wise adaptive updating (LWAU) method111Code is available at https://github.com/qyxqyx/LWAU.. Manually designing the better-suited layer-wise updating rule is inefficient and expensive. So, in LWAU, we train the meta-learner to learn not only the easy fine-tuning weight but also its favorite layer-wise adaptive updating rule.
2) Extensive experiments conducted on two FSIC benchmarks (e.g., Miniimagenet, and Tieredimagenet) validate the assumption and LWAU. As shown in Fig.1, when the LWAU meta-learner updating itself on the support set of FSIC tasks, it almost neglects its bottom layers and pays almost all attention to update its top layer.Meanwhile, the proposed LWAU apparently outperforms the other few-shot classification methods on both Miniimagenet and Tieredimagenet, which shows the effectiveness of the layer-wise updating rule. Besides, we visualize the learned sparse image representations of LWAU for a better understanding of LWAU.
3) We show that when tested on FSIC tasks, the proposed LWAU meta-learner can accelerate its update by updating only its top layers without performance decline. For example, compared with traditional meta-learning based methods which need updating all layers, LWAU can speed up the update for 5 and 10 times on 1- and 5- shot tasks on Miniimagenet, respectively.
In this section, we detail the proposed layer-wise adaptive updating (LWAU) method. For a fair comparison with the other few-shot classification methods, the network used in LWAU is the same as that in . We call the network Conv-4 and show its structure in Fig.2. It consists of five layers including four cascaded convolutions and one fully-connect. On an FSIC task , we train the meta-learner with the following three stages.
First, the meta-learner updates its weight on the support set for recognizing the categories in task . Note that, each layer updates with an exclusive learning-rate rather then the global learning-rate shared with the other layers. For clarity, we show only one update step of LWAU, and the update can be formulated as Eq.2 and Eq.3.
is the weight vector of the five layers (i.e., is the weight of -th layer, where [1, 5]). is a vector and each is a trainable scaler denoting the exclusive updating learning-rate of . is the support set of the task , and and is a pair of instance and label belonging to the support set of .
is the cross-entropy classification loss function.is the meta-learner’s loss on the support set, and is the prediction of the meta-learner. denotes the number of instances in the support set. With the update on the support set, each layer’s weight turns to with its exclusive updating learning-rate .
Secondly, the meta-learner with its updated weight is evaluated on the query set, which can be formulated as Eq.4.
is the query set of the task , and is the number of instances in the query set and is the meta-learner’s loss on the query set. Note that when evaluating the meta-learner on the query set, the meta-learner predicts the instance with the updated weight .
Finally, for the meta-learner learning the easy fine-tuning weight and its preferable layer-wise adaptive learning rule so that the meta-learner can fine-tune itself precisely on the support set to recognize categories of the task, both and are meta-trained. Thus, the training of LWAU is
where is the meta learning-rate. Note that uses the meta-learner’s loss on the query set to compute the gradients of and but not .
By training the meta-learner on lots of FSIC tasks, the meta-learner is forced to learn: 1) the easy fine-tuning initial weights for solving FSIC tasks, 2) proper layer-wise adaptive updating learning-rates to benefit the meta-learner’s learning from few images. With the learned and , the meta-learner learns on the support set more exactly than other meta-learners which only learn the weight . Algorithm. 1 shows the summarized training procedure of LWAU.
is a subset sampled from ImageNet. It contains 100 image classes, including 64 classes for training, 16 for validation, and 20 for testing. Each class in Miniimagenet is composed of 600 images and each image is resized into 84x84 resolution.
Tieredimagenet is another subset sampled from ImageNet. Different from Miniimagenet which contains similar image categories between the training and testing sets (i.e. “pipe organ” in the training and “electric guitar” in the testing set), Tieredimagenet hierarchically structures the image classes to make all image classes in the testing set are distinct from all classes in the training set. It contains 34 high-level image classes, including 20, 6, and 8 classes for training, validation, and testing, respectively. Each high-level class is composed of 10 to 30 low-level classes and each low-level class consists of about 1300 images. Same to Miniimagenet, all images in Tieredimagenet are resized into 84x84 resolution.
3.2 Experiment on Miniimagenet
We use Conv-4 as the network of the meta-learner and set the number of filters of each convolution layer to 32. Each convolution layer is followed with a batch-normalization and a ReLU operator. On the training set, we generate 200,000 5-way-shot classification training tasks, and on each of the validation and testing sets, we generate 600 5-way -shot validation or testing tasks. is set to 1 or 5, and in each task, the query set contains 15 samples for each way. We train the meta-learner on the training tasks for 60,000 iterations and set the meta batch-size to 4 and the meta learning-rate to 0.001. The optimizer we used is Adam. All the updating learning-rates in the vector are initialized to 0.01. On each training task, the meta-learner updates itself on the support set for 5 steps, and on each testing task, the meta-learner updates for 10 steps.
Experimental results on Miniimagenet are shown in Tab.1. Note that, for a fair comparison, all compared methods shown in Tab.1 use Conv-4 as their network. The proposed LWAU apparently outperforms the other methods. For example, compared with MAML and LLAMA, LWAU promotes the 5-way 1-shot performance for about 2.5% and 1.1%, respectively.
|Matching nets FCE||44.20%||57.00%|
3.3 Experiment on Tieredimagenet
As Tieredimagenet is a larger dataset than Miniimagenet, we set the number of filters of each convolution layer to 64 and generate 400,000 training tasks on the training set. We train the meta-learner for 120,000 iterations. L1 normalization of 0.001 is applied to prevent the meta-learner from over-fitting, and the meta learning-rate is decreased by 0.5 for every 10,000 iterations. All the other experimental settings are the same as those on Miniimagenet.
Tab.2 shows the experimental result on Tieredimagenet. Compared with MAML and TAML, LWAU promotes the 5-way 1-shot performance for about 1.4% and 0.8%, respectively. Both the experiments on Miniimagenet and Tieredimagenet demonstrate the advantage of the proposed LWAU for the FSIC problem.
3.4 Visualization and Analysis
Update Proportion To validate the effect of the layer-wise adaptive updating rule, we visualize the update proportion of each layer in Fig.1 and Fig.3. The calculation of each layer’s update proportion has been shown as Eq.1. Compared with the deep model and the other meta-learners, the proposed LWAU meta-learner learns to pay much more attention to update its top layer, especially on the 1-shot learning tasks. For example, when solving 5-way 1-shot testing tasks on Miniimagenet and Tieredimagenet, the LWAU meta-learner almost ignores its bottom layers and pays almost all attention to update its top layer (i.e., the update proportion of the 5-th layer is nearly 100%). The visualization supports our assumption that in the FSIC scene, the meta-learner greatly prefers updating its top layer to updating its bottom layers.
Learning Curve We visualize the learning curves of and the accuracy of LWAU in Fig.4. The learning curves are drawn when LWAU is trained on 5-way 1-shot FSIC tasks on Miniimagenet. It is clear that and rise up with the training of the meta-learner, while , and keep small to near zero across the training. LWAU achieves its maximum accuracy of 49.93% at around the 37,000 iteration.
Image Representation For a better understanding of LWAU, we visualize its learned image representation in Fig.5. The representation which is extracted for the fully-connect layer classifying the input image is an 800 length vector and we reshape the vector into a representation map with 20x40 resolution. The representations of MAML and deep model are also shown as comparison and all representations are normalized to have a maximum value of one. From Fig.5, we can see that LWAU extracts the sparsest image representation and deep model extracts the densest representation.
We quantify the representation sparsity with the neuron activation percentage. The neuron activation percentage of LWAU’s representation is about 24.3%, and those of MAML and deep model are about 30.8% and 63.1%, respectively. Note that activated neuron is the neuron with a non-zero value that responses to the input image. Lots of work[34, 35, 20, 39] have demonstrated that sparse representation benefits image classification and lots of methods[36, 19, 18] improve image classification via utilizing sparse representation. Fig.5 clearly shows that compared with MAML, the proposed LWAU extracts sparser representation, which might be the reason why LWAU outperforms MAML.
3.5 Update Efficiency
Fig.1 and Fig.3 show that the LWAU meta-learner pays little attention to update its bottom layers. This indicates that the bottom layers’ update might contribute little to LWAU’s few-shot classification performance. In other words, it is possible to accelerate the LWAU meta-learner’s update under the premise of no performance decline via updating only its top layers. To verify this point, we do an experiment on Miniimagenet that when tested on an FSIC task, the meta-learner updates only its top layers on support set and freezes its bottom layers.
The experimental results are shown in Fig.6. Each number at the -axis denotes the meta-learner’s bottom layers are frozen when it updating on support set.When =0, the meta-learner can update all its layers when testing. We treat the meta-learner’s performance at =0 as its baseline. Obviously, Fig.6 shows that freezing the bottom layers effects the LWAU meta-learner hardly while effects MAML greatly. For example, when 03, the LWAU meta-learner’s performance approximately equivalent to its baseline. Whereas, when 0, the MAML meta-learner performs apparently worse than its baseline. This experiment reveals a notable advantage of the LWAU meta-learner that it needs updating only its top layers, which can significantly accelerate its update. When =3, the meta-learner’s update costs about 6.3 and 11 on 5-way 1- and 5-shot task, respectively, while the baseline costs 35 and 120 (we evaluate the meta-learner’s update time consuming on one GTX1060 GPU). As a conclusion, when tested on an FSIC task, the proposed LWAU can improve its efficiency of learning from few images for at least 5 times by updating only its top 2 layers without sacrificing its FSIC performance.
In this paper, we propose a novel meta-learning based layer-wise adaptive updating (LWAU) method to solve few-shot image classification. Different from other meta-learning based methods which commonly train a meta-learner to learn only easy fine-tuning weight, LWAU trains a meta-learner to learn not only the easy fine-tuning weight but also a layer-wise adaptive updating rule to improve the meta-learner’s learning on few images. Extensive experiments show that compared with the other meta-learning based methods, the proposed LWAU achieves not only better few-shot image classification performance but also higher fine-tune efficiency on few images. Besides, the visualization of extracted image representations shows that LWAU extracts sparse representations, which benefits to the understanding of LWAU.
-  A. Achille, M. Lam, R. Tewari, A. Ravichandran, S. Maji, C. C. Fowlkes, S. Soatto, and P. Perona. Task2vec: Task embedding for meta-learning. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
-  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei.
On the optimization of a synaptic learning rule.
Preprints Conf. Optimality in Artificial and Biological Neural Networks, pages 6–8. Univ. of Texas, 1992.
-  G. Denevi, D. Stamos, C. Ciliberto, and M. Pontil. Online-within-online meta-learning. In Advances in Neural Information Processing Systems 32, pages 13110–13120. Curran Associates, Inc., 2019.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.
Imagenet: A large-scale hierarchical image database.
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
-  Y. Dong, J. Feng, L. Liang, L. Zheng, and Q. Wu. Multiscale sampling based texture image classification. IEEE Signal Processing Letters, pages 614–618, 2017.
-  Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. Rl: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
-  C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, volume abs/1703.03400, 2017.
-  S. Gidaris and N. Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4367–4375, 2018.
-  E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths. Recasting gradient-based meta-learning as hierarchical bayes. International Conference on Learning Representations, 2018.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
-  S. Huang, J. Lin, and L. Huangfu. Class-prototype discriminative network for generalized zero-shot learning. IEEE Signal Processing Letters, pages 1–1, 2020.
-  M. A. Jamal and G.-J. Qi. Task agnostic meta-learning for few-shot learning. IEEE Conference on Computer Vision and Pattern Recognition, pages 11719–11727, 2019.
-  H. Jiang, R. Wang, S. Shan, and X. Chen. Adaptive metric learning for zero-shot recognition. IEEE Signal Processing Letters, pages 1–1, 2019.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. international conference on learning representations, 2015.
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.In Advances in neural information processing systems, pages 1097–1105, 2012.
-  K. Lee, S. Maji, A. Ravichandran, and S. Soatto. Meta-learning with differentiable convex optimization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
-  C. Li, J. Guo, and H. Zhang. Local sparse representation based classification. In 2010 20th International Conference on Pattern Recognition, pages 649–652, 2010.
-  C.-Y. Lu, H. Min, J. Gui, L. Zhu, and Y.-K. Lei. Face recognition via weighted sparse representation. J. Visual Communication and Image Representation, pages 111–116, 2013.
-  J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for local image analysis. CVPR, 2008.
-  N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner. 2018.
-  Y. Pang, M. Sun, X. Jiang, and X. Li. Convolution in convolution for network in network. IEEE transactions on neural networks and learning systems, 29(5):1587–1597, 2018.
-  Y. Qin, C. Zhao, X. Zhu, Z. Wang, Z. Yu, T. Fu, F. Zhou, J. Shi, and Z. Lei. Learning meta model for zero- and few-shot face anti-spoofing. Association for Advancement of Artificial Intelligence (AAAI), 2020.
-  A. Rajeswaran, C. Finn, S. Kakade, and S. Levine. Meta-learning with implicit gradients. Annual Conference on Neural Information Processing Systems, pages 113–124, 2019.
-  S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. 2016.
-  M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel. Meta-learning for semi-supervised few-shot classification. International Conference on Learning Representations, 2018.
A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. P. Lillicrap.
Meta-learning with memory-augmented neural networks.
International Conference on Machine Learning, pages 1842–1850, 2016.
-  J. Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992.
-  J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
Q. Sun, Y. Liu, T.-S. Chua, and B. Schiele.
Meta-transfer learning for few-shot learning.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, volume 4, page 12, 2017.
-  O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
-  L. Wang, B. Yang, Y. Chen, X. Zhang, and J. Orchard. Improving neural-network classifiers using nearest neighbor partitioning. IEEE transactions on neural networks and learning systems, 28(10):2255–2267, 2017.
-  J. Wright, Y. Ma, J. Mairal, G. Sapiro, S. T. Huang, and S. Yan. Sparse representation for computer vision and pattern recognition. Proceedings of the IEEE, pages 1031–1044, 2010.
-  J. Wright, Y. A. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell., pages 210–227, 2009.
-  L. Zhang, W. Zhou, P. Chang, J. Liu, Z. Yan, T. Wang, and F. Li. Kernel sparse representation-based classifier. IEEE Transactions on Signal Processing, 60(4):1684–1695, 2012.
-  R. Zhang, T. Che, Z. Ghahramani, Y. Bengio, and Y. Song. Metagan: An adversarial approach to few-shot learning. In Advances in Neural Information Processing Systems, pages 2367–2376, 2018.
S. Zhang, X. Li, M. Zong, X. Zhu, and R. Wang.
Efficient knn classification with different numbers of nearest neighbors.IEEE transactions on neural networks and learning systems, 29(5):1774–1785, 2018.
-  S. Zhang, H. Wang, and W. Huang. Two-stage plant species recognition by local mean clustering and weighted sparse representation classification. Cluster computing, 20(2):1517–1525, 2017.
W. Zhang, Q. Yin, W. Wang, and F. Gao.
One-shot blind cfo and channel estimation for ofdm with multi-antenna receiver.IEEE Transactions on Signal Processing, 62(15):3799–3808, 2014.