Deep learning models often encounter the problem of overfitting when the amount of training data is limited, for example, one can expect that the given training data is insufficient compared to the complexity of deep models. The idea of meta-learning  can be employed to alleviate this problem. Meta-learning methods randomly sample data from the training set to simulate a test scenario, which is called a task or an episode hereinafter. Note that one episode contains a support set and a query set, the former of which consisting of a small number of labeled samples, while the latter consisting of unlabeled samples with labels to be predicted. In a meta-learning paradigm, thousands of randomly constructed tasks are used to train the model so that learned parameters can extract transferable knowledge, which makes a learned model generalize on unseen tasks.
On the basis of meta-learning, many recently proposed few-shot classification methods [20, 17, 11, 15, 14, 18] have gained better generalization abilities than universal deep learning models in few-shot paradigms. Some of them [11, 15, 14] make use of distance metrics to measure distances between embeddings of two samples; however, they can not make full use of the overall information of support samples and a query sample.  treats few-shot classification as a sequence-to-sequence problem, which does not suffer from the shortcomings of those using distance metrics. Notably,  proposes an attentive meta-learner called SNAIL with the help of temporal convolutions  and soft attention . It claims to be able to learn a more flexible strategy. However, when dealing with few-shot learning tasks, SNAIL requires the embeddings of samples to be fed into the meta-learner as a sequence with each time step of the sequence being a sample-label pair. However, samples in each task are not spread across time, therefore it is hard for SNAIL to learn a feasible model. It is also reported in  that the authors didn’t train the model successfully with an LSTM-based  meta-learner on miniImagenet  with a 4-layer feature extractor.
To overcome the problem that some metric-based few-shot learning methods can not make full use of the overall information of support samples and a query sample. In this paper we propose to construct a channel vector sequence which is combined with the whole information of support samples and a query sample. In this paradigm, each sample is sent into a convolutional neural network, and a corresponding multi-channel feature map is generated. After that, each feature map is converted into a embedding vector (We will call this vector as channel vector, hereinafter) of a specific scale through a fully connected network. At last, the channel vector sequence is spliced by the corresponding channel vectors of support samples and a query sample in the original channel order. After the channel vector sequence is generated, the rest to do is to extract the internal relationship among the channels and use such information to perform few-shot classification. A natural idea is to apply LSTM  or GRU  on channel vector sequence to infer the relationship.
Experimental results show that the spliced channel vector sequence with LSTM or GRU model can get similar results with the state-of-the-art methods in few-shot classification tasks on CUB  and cross-domain scenario . This is because channel vector sequence contains distinguish features which implicit the relationship between query sample and support set samples, and the sequence relationship can also be established on channel vector sequence through back propagation method . However, both LSTM and GRU have a disadvantage. They need to use the output of previous step as the input of current step, so none of these methods support parallel training.
To solve the problem that LSTM and GRU cannot train channel vector sequence in parallel and get more distinguishing features, we proposed forget-update module constructed by forget-update blocks, which uses 1-dimensional causal dilated convolution as base block, so it can be trained in parallel. And every single forget-update block has two parts: a forget block and a update block. Fig. 1 give the motivation of forget-update module. Specially, the forget block equipped with a forget gated activation  which can optimize existing information, and the update block equipped with an update gated activation  which can generate new information and establish dense connections. Experimental results demonstrated the effectiveness of forget-update module on few-shot classification benchmarks. The proposed forget-update module can get state-of-the-art results in few-shot classification tasks.
The main contribution of this paper is twofold. First, we proposed to use the channel vector sequence which contains the overall information of support samples and a query sample to infer categories of query samples. When the proposed channel vector sequence combined with sequence prediction methods, such as LSTM  and GRU , the relationship between the query samples and the support samples can be inferred. Experimental results show that we can get competitive results with the state-of-the-art few-shot learning methods on CUB  and cross-domain scenario  when combines channel vector sequence with LSTM  or GRU . Second, we design forget-update module stacked of forget-update blocks, which can produce more distinguish features in few-shot classification tasks. Experiments show that when combines the forget-update module with channel vector sequence, we can achieve state-of-the-art results on miniImagenet , CUB  and cross-domain scenario .
Code and datasets will be released once the article is accepted.
2 Related Work
2.1 Metric-based Few-shot Leaning
A number of few-shot classification methods are metric-based, i.e., they learn a set of projection functions that project the inputs to an embedding space, and a certain distance metric that measures the distance between any two embeddings. Those methods aim to make the samples from the same category closer in the embedding space, while those from different categories being distant from each other. For instance, Siamese Network  extracts features from a pair of samples and calculates the similarity relationship between the two feature vectors, thus the classification is done by comparing the samples in the query set and the support set in such a manner. Prototypical Network  is based on Euclidean distance metrics and uses the mean of embeddings from the same category as the prototype of that category. Relation Network  is similar to Prototypical Network, except that it employs a neural network to learn a deep instance metric, instead of using a fixed one.
2.2 Meta-learner based Few-shot learning
Some other methods construct a meta-learner that learn to make updates to the parameters to a traditional learner designed for scenarios with a great amount of data.  provided a method to initialize the parameters of the traditional learner in a way that a few gradient descent steps with a small amount of training data from a new task will lead to good generalization performance on that task.  proposed to use the embedding vectors of the newly seen samples to imprint weights for the new classes on the rear of the base network. The traditional learner used in [qiao2018few] is a convolutional-network-based network, and the method proposed in [qiao2018few] learns to update the parameters of the last full-connected layer of the base network, based on the newly seen samples.
2.3 Time Series Prediction Methods
Aside from the recurrent neural networks (RNNs for short), a lot of recently proposed methods have been proven to work well in time series tasks, and some of them even show better performance than RNNs do. proposed a network architecture called Transformer to solve the sequence to sequence problem. The Transformer is built based on attention mechanism instead of RNNs, which makes it more parallelizable and requires less time to train. Recently, a 1-dimensional convolutional neural network architecture named TCN  is proposed by Bai et al. and shows its effectiveness in sequence modeling problems. By combining causal convolution, dilated convolution, and shortcut connection, TCN is able to make use of information in a long history.
3 Proposed Approach
In this section, after defining the problem, we describe the channel vector sequence construction module, the causal dilated convolution block, the forget-update module and the prediction module used in our method one by one in detail. The overall framework of the proposed method is shown in Fig. 2.
3.1 Problem Definition
We first give the general setup and notations of few-shot classification in this paper. The purpose of few-shot classification is to build a model
, which can classify unlabeled samples with the help of a few labeled samples. Each few-shot classification task(or episode) contains a labeled support set , an unlabeled query set and an output set , which satisfies that the elements of and do not intersect. The output set is the set of the labels of all the samples in . We only consider -way -shot few-shot classification paradigm in this paper. In such a paradigm, every support set contains exactly type of class and each class has samples, the query set contains some unlabeled samples that belong to the classes in . The output set includes the corresponding labels of elements in the query set .
where is the number of classes, is the sample number of each class in support set , and is the size of the query set. The subscripts of indicate that is the -th sample in the -th support class.
As in other state-of-the-art few-shot classification methods [11, 20, 15, 14, 18], our method trains a meta-learner to fit the few-shot classification task through minimizing the loss of its predictions over the query set as in Equation 4.
where indicates the meta-learner and
is the loss function. We need to train the meta-learnerwith thousands of randomly sampled tasks under the constraint of loss function .
3.2 Channel Vector Sequence Construction Module
For each task , we first extract the feature map of each sample in the support set and the query set . When deal with -shot tasks where , for each class we calculate an element-wise average over the feature maps of all the samples in that class to form a class-level feature map. The class-level feature map for the -th class can be formulated by Equation 5.
where is a feature extractor ( for grayscale images and for RGB images). We use where is a 4-layer 2-dimensional convolution block and is a dimension reduction block. The output of is a feature map consisting of channels. then squeezes the information in each channel to a -dimensional vector use two consecutive fully-connected layers. Note that we adopt and according to experiments.
After that, we perform channel-level stitching between all class-level feature maps in the support set and feature maps of images in the query set . Equation 6 shows how to stitch channels.
where is the channel concatenate function which is used to splice according to the order of channels. , is the number of channels, and is the feature dimension of each sample. is called channel vector sequence in this paper.
Then, we transform few-shot classification problem into a sequence prediction problem on the channel vector sequence. We formalize the prediction model as Equation 7.
where indicates -th element in query set , c is the -th channel in , is a sequence prediction model, is the label of .
3.3 Causal Dilated Convolution Block
The causal dilated convolution blocks are the basis of the proposed method. We use causal dilated convolution blocks on channel vector sequence. Causal convolutions produce an output of the same length as the input and the newly generated data only depends on the data information before the current point. In addition, dilated convolution is also adopted to improve the range of receptive field on the channel vector sequence. The dilation factor increased exponentially and can be formalized as Equation 8.
where is the kernel size and indicates the -th layer of forget-update module. The detail of forget-update module is described in section 3.4.
With the help of causal dilated convolution block, we can cover a very large receptive field with a few layers. Causal dilated convolution block is first used for generating raw audio in .
3.4 Forget-Update Module
Forget-update module consists of stacked forget-update blocks. The detail of forget-update module is described in Algorithm 1 and the forget-update block is illustrated in Fig. 3. The forget block is designed to upgrade existed information and the update block is designed to reproduce new information and establish dense connection. The forget block generates data which has the same size as the input. Forget block can be formalized as Equation 9.
where is causal dilated convolution function,
is a sigmoid function,is dilated rate, is kernel size, is the i-th input to the i-th forget-update block in forget-update module.
The update model will generate data which has the same sequence length as the input, and the dimension of each channel vector is set to . The channel vector sequence is used as the input of the first forget-update block in the proposed model. The output channel vector sequence will be used as input for the next forget-update block. The process of update block can be formalized as Equation 10,11,12.
is hyperbolic tangent activation function.
And finally, we stitch and in feature dimension direction as the output of forget-update block.
3.5 Prediction Module
The prediction module is a three-layer fully connected network, which uses Relu as the activation function. The weights of the prediction model are initialized using the method described in  and weight normalization  is applied. The prediction module predicts values which represent the similarity relationship between the query sample’s feature map and the class-level feature maps of those support classes.
4 Experiments and Discussion
In this section, we evaluate the proposed method and some state-of-the-art few-shot classification methods in three scenarios, including generic object recognition, fine-grained classification, and cross-domain scenario classification.
4.1 Datasets and Scenarios
For the scenario of generic object recognition, we use the widely used miniImagenet dataset . miniImagenet is a subset of Imagenet  and consists of 100 classes, each of which contains 600 images with a size of 8484. We follow the split used in  and split the dataset into 64, 16 and 20 classes for training, validation and testing, respectively.
For the scenario of fine-grained classification, we use the CUB dataset . It contains 11,788 images from 200 classes in total. In our experiments, we split the dataset into 100, 50 and 50 classes for training, validation and testing, respectively.
For the cross-domain scenario (miniImagenet CUB. For simplicity, we will call it cross hereinafter), we follow the setting used in , which uses miniImagenet as the training set while using the 50 CUB validation classes for validation and the 50 CUB testing classes for testing. This is to test out the performance of our method when the effect of domain shift is relatively significant.
4.2 Implementation Details
 first describes the meta-learning training setup in few-shot classification areas. All the methods in this paper are trained with meta-learning strategy. Specially, each prediction in a task (or episode) only relies on a corresponding support set. In the meta-training process, we train 60,000 episodes when using CUB or cross as the dataset, and 120,000 episodes with miniImagenet. We adopt the -way -shot few-shot classification paradigm. In each episode, we randomly choose classes to use in this episode. We then randomly sample images for each previously chosen class to make up the support set. The size of the query set is fixed to 16. 1-shot and 5-shot classification are evaluated on miniImagenet dataset. 1-shot, 3-shot and 5-shot classification are evaluated on CUB and cross datasets. In the meta-training process, the model, which has the best accuracy when evaluated in validation set, is saved. And the saved model is used to evaluate the test accuracy. In the meta-testing process, we test the model with 600 episodes and adopt the average of all the prediction results as the testing accuracy.
In this paper, all the methods are trained from scratch and adopt Adam  as optimizer. The optimizer takes an initial learning rate of 0.001, and we reduce the learning rate by 10% when the testing accuracy stagnates in 7 consecutive training steps. For each convolutional layer in , a Batchnorm  regularization module is inserted between the convolution and the activation function. All the convolutions uses 64 kernels and the activation function is ReLU . A max-pooling operation is added to the first two layers. We only apply normalization, scale and center-crop operation on the input images without data enhancement. We reimplemented MatchingNet , ProtoNet  and RelationNet  on CUB and cross.
The proposed model contains two forget-update modules. Every forget-update module contains () forget-update blocks. The filter_size of all forget-update blocks in the first and second forget-update module is set to 16 and 32, respectively.
4.3 Experiment Results
Our Experiments are designed in the purpose of to answer these questions:
How does the proposed method compare to exist methods?
Evaluate the effects of forget block and update block.
To validate the effectiveness of the proposed method, we compare it with MatchingNet , ProtoNet  and RelationNet  on miniImagenet, CUB datasets, and cross-domain scenario. In addition, we also compared with MAML  and SNAIL  on miniImagenet.
Table 1, 2, 3 illustrate the object recognition capability on miniImagenet, CUB and cross-domain scenario, respectively. All the methods use a 4-layer convolutional network as backbone. Table 1 shows that the proposed method achieves the best performance in 5-way 1-shot paradigm and achieves the second-best performance in 5-way 5-shot paradigm. ProtoNet has a big improvement over the proposed method. This is most likely because ProtoNet is trained in a 20-way 5-shot paradigm, and tested in a 5-way 5-shot paradigm, as a result, it gains greater discrimination on object recognition task in 5-way 5-shot paradigm. Table 2 shows that the proposed method has made great progress on both 5-way 3-shot and 5-way 5-shot tasks and achieve a second place on 5-way 1-shot task. Table 3 shows that the proposed method has a larger improvement on 5-way 1-shot and 5-way 3-shot tasks than the comparative methods, and achieves sub-optimal result on 5-way 5-shot task.
Table 4 shows the results of the ablation experiments. TCN  is the baseline method which uses causal dilated convolution with identity connection  and uses similar channel configuration as the proposed method. +update method uses the proposed update block but no forget block. The proposed method uses the forget-update module and put channel vector sequence as input. Table 4 shows that the forget block and update block can improve the prediction performance on miniImagenet, CUB and cross.
Table 5 shows the prediction results of GRU  and LSTM  methods with the proposed channel vector sequence as input. LSTM and GRU hidden layer dimensions are both set to 512, the number of layers are set to 2 and bidirectional mode are both set to false. Table 5 shows that most results of LSTM and GRU methods have a better performance than the state-of-the-art few-show classification methods in CUB and cross-domain scenario, and the performance of LSTM and GRU on miniImagenet have competitive results with matchnet . It reflects that the proposed channel vector sequence can be combined with time sequence models, and this method can be used to infer the similarity relationship between query sample and support set samples. It also indicates that the proposed method almost completely exceeds LSTM and GRU, which means that the proposed forget-update module is more suitable for the proposed channel vector sequence than LSTM and GRU.
From the experimental results, we find that the pipeline which uses channel vector sequence as input and uses forget-update module as relation discriminator can get state-of-the-art results for few-shot classification. We think that the update block is equivalent to a dense connection mechanism, which can generate new information when passing through each update block. The forget block can modify existing information with learned weight. The combination of LSTM and GRU with the proposed channel vector sequence can get state-of-the-art results on CUB and cross-domain scenario, indicates that the similarity relationship between class-level features and query sample’s feature is implicated in the proposed channel vector sequence.
In this paper, we study the spliced channel vector sequence of support samples and a query sample. Experimental results show that putting the spliced channel vector sequence as the input of LSTM and GRU can get competitive results with state-of-the-art few-shot classification methods in the CUB dataset and cross-domain scenario (miniImagenet CUB). This shows that the sequence prediction methods can be used to infer the similarity relationship with the help of channel vector sequence. We also proposed a forget-update module, and the proposed module shows promising results on few-shot classification tasks when using the proposed channel vector sequence as input.
-  Diederik P. Kingma and Jimmy Ba: Adam: A Method for Stochastic Optimization. International Conference on Learning Representations (2015)
-  Wah, Catherine and Branson, Steve and Welinder, Peter and Perona, Pietro and Belongie, Serge: The caltech-ucsd birds-200-2011 dataset. California Institute of Technology (2011)
Russakovsky, Olga and Deng, Jia and Su, Hao and Krause, Jonathan and Satheesh, Sanjeev and Ma, Sean and Huang, Zhiheng and Karpathy, Andrej and Khosla, Aditya and Bernstein, Michael and others: Imagenet large scale visual recognition challenge. International journal of computer vision115(3), 99–110(2015)
-  Bengio, Samy and Bengio, Yoshua and Cloutier, Jocelyn and Gecsei, Jan: On the optimization of a synaptic learning rule. Preprints Conf. Optimality in Artificial and Biological Neural Networks 6–8(1992)
-  Andrychowicz, Marcin and Denil, Misha and Gomez, Sergio and Hoffman, Matthew W and Pfau, David and Schaul, Tom and Shillingford, Brendan and De Freitas, Nando:Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems 3981–3989(2016)
-  Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Łukasz and Polosukhin, Illia: Attention is all you need. Advances in neural information processing systems 5998–6008(2017)
-  Wei-Yu Chen and Yen-Cheng Liu and Zsolt Kira and Yu-Chiang Frank Wang and Jia-Bin Huang: A Closer Look at Few-shot Classification. International Conference on Learning Representations (2019)
-  Dai, Zihang and Yang, Zhilin and Yang, Yiming and Cohen, William W and Carbonell, Jaime and Le, Quoc V and Salakhutdinov, Ruslan: Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019)
-  Cho, Kyunghyun and Van Merriënboer, Bart and Gulcehre, Caglar and Bahdanau, Dzmitry and Bougares, Fethi and Schwenk, Holger and Bengio, Yoshua: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Kanazawa, Angjoo and Zhang, Jason Y and Felsen, Panna and Malik, Jitendra: Learning 3d human dynamics from video. IEEE Conference on Computer Vision and Pattern Recognition 5614–5623 (2019)
Munkhdalai, Tsendsuren and Yu, Hong: Meta networks. Proceedings of the 34th International Conference on Machine Learning7(70), 2554–2563 (2017)
Hochreiter, Sepp and Schmidhuber, Jürgen: Long short-term memory. Neural computation9(8), 1735–1780 (1997)
-  Sachin Ravi and Hugo Larochelle: Optimization as a Model for Few-Shot Learning. International Conference on Learning Representations (2017)
-  Flood Sung,Yongxin Yang,Li Zhang,Tao Xiang,Philip H. S. Torr,Timothy M. Hospedales: Learning to Compare: Relation Network for Few-Shot Learning. IEEE Conference on Computer Vision and Pattern Recognition 1199–1208 (2018)
-  Jake Snell,Kevin Swersky,Richard S. Zemel: Prototypical Networks for Few-shot Learning. Advances in Neural Information Processing Systems 4077–4087 (2017)
-  Hang Qi,Matthew Brown,David G. Lowe: Low-Shot Learning With Imprinted Weights. IEEE Conference on Computer Vision and Pattern Recognition 5822–5830 (2018)
-  Koch, Gregory and Zemel, Richard and Salakhutdinov, Ruslan: Siamese neural networks for one-shot image recognition. ICML deep learning workshop (2015)
-  Koch, Gregory and Zemel, Richard and Salakhutdinov, Ruslan: Siamese neural networks for one-shot image recognition. ICML deep learning workshop (2015)
-  Shaojie Bai,J. Zico Kolter,Vladlen Koltun: An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. CoRR (2018)
-  Chelsea Finn,Pieter Abbeel,Sergey Levine: Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. International Conference on Machine Learning 1126–1135(2017)
-  Aäron van den Oord,Sander Dieleman,Heiga Zen,Karen Simonyan,Oriol Vinyals,Alex Graves ,Nal Kalchbrenner,Andrew W. Senior,Koray Kavukcuoglu: WaveNet: A Generative Model for Raw Audio. CoRR 1126–1135(2016)
Wenling Shang,Kihyuk Sohn,Diogo Almeida,Honglak Lee: WaveNet: Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units. International Conference on Machine Learning 2217–2225(2016)
-  Wenling Shang,Kihyuk Sohn,Diogo Almeida,Honglak Lee: WaveNet: Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units. International Conference on Machine Learning 2217–2225(2016)
-  Tim Salimans,Diederik P. Kingma: Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. Advances in Neural Information Processing Systems (2016)
-  Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun: Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. IEEE International Conference on Computer Vision 1026–10345(2015)
-  Victor Garcia Satorras,Joan Bruna Estrach: Few-Shot Learning with Graph Neural Networks. International Conference on Learning Representation (2018)
-  Alex Nichol,Joshua Achiam,John Schulman: On First-Order Meta-Learning Algorithms. CoRR (2018)
-  Rumelhart, David E and Hinton, Geoffrey E and Williams, Ronald J: Learning representations by back-propagating errors. nature 323(6088), 533–536 (1986)
-  Kaiming He,Xiangyu Zhang,Shaoqing Ren and Jian Sun: Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016)
-  Ross B. Girshick,Jeff Donahue,Trevor Darrell and Jitendra Malik: Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. IEEE Conference on Computer Vision and Pattern Recognition 323(6088), 580–587 (2014)
-  Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick and Ali Farhad: You Only Look Once: Unified, Real-Time Object Detection. IEEE Conference on Computer Vision and Pattern Recognition 779–788 (2016)
-  Olaf Ronneberger: Invited Talk: U-Net Convolutional Networks for Biomedical Image Segmentation. Bildverarbeitung für die Medizin 2017 - Algorithmen - Systeme - Anwendungen. Proceedings des Workshops vom 12. bis 14. März 2017 in Heidelberg 3 (2017)
-  Fei-Fei, Li and Fergus, Rob and Perona, Pietro: One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28(4),594–611 (2016)
-  Lake, Brenden M and Salakhutdinov, Ruslan and Tenenbaum, Joshua B: Human-level concept learning through probabilistic program inductio.Science 28(4),1332–1338 (2015)
Sergey Ioffe, Christian Szegedy: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift 448–456 (2015)
-  Gao Huang,Zhuang Liu,Laurens van der Maaten,Kilian Q. Weinberger: Densely Connected Convolutional Networks 2261–2269 (2017)
-  LNCS Homepage, http://www.springer.com/lncs. Last accessed 4 Oct 2017