Deep learning (DL) is already ubiquitous in our daily lives, including image-based object detection liu2020deepwani2020supervised, and medical imaging and healthcare wang2020medical. While DL is outperforming traditional machine learning methods in these aforementioned application areas raschka2020machine, a major downside of DL is that it requires large amounts of data to achieve good performance lecun2015deep. Few-shot learning (FSL) is a subfield of DL that focuses on training DL models under scarce data regimes, thereby opening possibilities for applying DL to new problem areas where the amount of labeled data is limited.
In FSL settings, datasets are comprised of large numbers of categories (i.e., class labels), but only a few images per class are available. The main objective of FSL is the design of methods that achieve good generalization performance from the limited number of images per category. The overarching concept of FSL is very general and applies to different data modalities. However, most FSL research is focused on image classification wang2019generalizing so that we will use the terms examples and images
(in a supervised learning context) interchangeably.
Most FSL methods use an episodic training strategy known as meta-learning vinyals2016matching, where a meta-learner is trained on (classification) tasks with the goal to learn to perform well on new, unseen tasks. While many of the most recent FSL methods are based on episodic meta-learning snell2017prototypical; sung2018learning; finn2017model; Ravi2017
, another successful approach to FSL is the use of transfer learning, where models are trained on large datasets and then appropriately transferred to smaller datasets that contain the novel target classesqi2018low; gidaris2018dynamic; qiao2018few.
Apart from recent developments in FSL, many researchers have recently proposed methods for implementing graph neural networks to extend deep learning approaches for graph-structured data. In this context, graphs are used as data structures for modeling the relationships (edges) between data instances (nodes) scarselli2008graph; kipf2017semi; velivckovic2018graph; gilmer2017neural; duvenaud2015convolutional. Since FSL methods are centered around modeling relationships between the examples in the support and query datasets, graph neural networks have also gained a growing interest in FSL research Garcia2018; Liu2019; kim2019edge. Graph neural networks can be computationally prohibitive on large datasets. However, we shall note that one of the significant characteristics of FSL is that datasets for meta-training and meta-testing contain only "few" examples per class, such that the computational cost of graph construction becomes small in FSL.
Previous research has shown that FSL can be improved by incorporating additional information. For instance, unlabeled data ren2018meta; li2019learning; yu2019transmatch and additional modalities (e.g., textual information describing the images to be classified) xing2019adaptive; schonfeld2019generalized could improve the predictive performance of FSL models. While the aforementioned works showed that additional external information benefits FSL, we raise the question of whether additional internal information can be useful as well.
While the incorporation of additional information can be beneficial, the utilization of additional internal information is not very common in FSL research, and only two recent research papers explored this approach li2019revisiting; lifchitz2019dense
. In these works, the researchers expanded the feature embedding vectors of the data inputs (i.e., images), obtained from the last layer in the neural network, to higher-dimensional embeddings. These higher-dimensional embeddings were split into several smaller vectors, such that multiple embedding vectors correspond to the same image. In the DN4 model proposed by Li et al.li2019revisiting, the last layer’s feature embeddings were expanded to form many local descriptors. The dense classification network by Lifchitz et al. lifchitz2019dense expanded the feature embeddings to three separate vectors that are used for computing the cross-entropy loss during training.
When it comes to utilizing additional internal information, both DN4 li2019revisiting and the dense classification network lifchitz2019dense only considered the last layer’s information. In contrast to existing work on FSL, we consider additional information that is hidden in the earlier layers of the neural network. We hypothesize that such internal information benefits an FSL model’s predictive performance. More specifically, the extra information hidden in the network considered in this work is comprised of the feature embeddings that can be obtained from layers before the last layer. We propose using a graph structure to integrate this lower-level information into the neural network.
We refer to the FSL method proposed in this paper as Looking-Back, because unlike DN4 li2019revisiting and the dense classification network lifchitz2019dense, this method is looking back at lower-level information rather than focusing on the final layer’s feature embeddings alone. During training, the lower-level information is expected to help the meta-learner to absorb more information overall. Although this lower-level information may not be as useful as the embedding vectors obtained from the last layer, we hypothesize that the lower-level information has a positive impact on the meta-learner. To test this hypothesis, we experiment with the popular Conv-64F model li2019revisiting as a backbone, and we follow the TPN method Liu2019 for graph construction and label propagation.
Besides the feature embeddings of the last layer, the previous layers’ feature embeddings (i.e., lower-level information) are also used for computing the pair-wise similarities between the inputs, based on relational network structures, which differs from the original TPN implementation Liu2019. In the Looking-Back method, three groups of pair-wise similarity measures are computed. The similarity scores between all support and query images in one episode amount to three separate graph Laplacians, which are used for iterative label propagation, to generate three separate cross-entropy losses. As the experimental results indicate, the losses from lower-level features are used during meta-training to enhance the performance of the meta-learner. After meta-training, we adopt the last layer’s feature embeddings for testing on new tasks (i.e., images with class labels that are not seen during training) in a transductive fashion. As the experimental results reveal, the resulting FSL models have a better predictive performance on new, unseen tasks compared to models generated by meta-learners that don’t utilize lower-level information.
The contributions of this work can be summarized as follows:
We propose an FSL meta-learner, Looking-Back, that utilizes lower-level information from hidden layers, which is different from existing FSL methods that only use feature embedding of the last layer during meta-training.
We implement our Looking-Back method using a graph neural network, which fully utilizes the advantage of graph structures for few-shot learning to absorb the lower-level information in the hidden layers of the neural network.
We evaluate our proposed Looking-Back method on two popular FSL datasets, miniImageNet and tieredImageNet, and achieve new state-of-the-art results, providing supporting evidence that using lower-level information could result in better meta-learners in FSL tasks.
2 Related Work
In this section, we discuss the recent developments in FSL with a focus on methods related to our work. We group these related FSL methods into two main categories, meta-learning-based approaches and transfer learning-based approaches.
FSL, based on meta-learning, typically uses episodic training strategies. In each episode, the meta-learner is trained on a meta-task, which can be thought of as an image classification task. During training, these tasks are drawn randomly from the training dataset across the episodes. During the model evaluation, tasks are chosen from a separate test dataset, which consists of images from novel classes that are not contained in the training dataset.
In -way--shot FSL, when a meta-learner is trained on several tasks sampled from the training dataset, each training task is subdivided into a support set and a query set. Each task consists of unique class labels, and the support set consists of labeled images per class. Utilizing the support set, the model learns to predict the image labels in the query set. After training, the meta-learner is then evaluated on new tasks sampled from the test set. Similar to the training tasks, each new task consists of unique class labels with images (in the support set) each. However, to assess how well the meta-learner performs on new tasks, the classes in the test dataset are not overlapping with the classes in the training set.
Based on the general FSL meta-learning framework described above, we can divide meta-learning approaches further into metric-, optimization-, and graph-based meta-learning, which we discuss in the following subsections.
2.1.1 Metric-based Meta-learning
Metric-based methods are primarily focused on learning feature embeddings that enable similarity comparisons between support and query images. The Prototypical Network snell2017prototypical used a Euclidean distance measure to compare the feature embeddings of the query images with centroids of the support images in different classes. The Relation Network sung2018learning constructed an additional network to compute the similarity score between images directly, instead of using the Euclidean distance measure on the images’ feature embeddings similar to the Prototypical Network. DN4 li2019revisiting
used a cosine similarity measure on multiple local descriptors, obtained by expanding the feature embeddings of the last layer to higher dimensions, to find the most similar images via nearest neighbor search.
2.1.2 Optimization-based Meta-learning
Optimization-based methods are focused on parameter optimization and how to rapidly learn knowledge from limited training images that can be adapted to novel images. The model agnostic meta-learning framework (MAML) finn2017model learned a general model that can be efficiently fine-tuned to perform well on other tasks using conventional gradient descent-based optimization. While MAML used second-order partial derivatives to train the general model before task-specific fine-tuning, Reptile nichol2018first was a first-order approximation of MAML that simplified the training procedure and boosted computational performance. Ravi and Larochelle Ravi2017 introduced a related yet different approach to optimization-based meta-learning. They proposed the use of an LSTM to model the sequence corresponding to the sequential optimization of the model parameters across different tasks.
2.1.3 Graph-based Meta-learning
Graph-based meta-learning uses graph structures to model the relationship between query and support images based on relative similarity measures, where each labeled and unlabeled image represents a node in the graph. There are very few treatments of graph-based methods for FSL in the literature; however, the topic has recently gained more attention in the FSL research community.
In 2017, Garcia and Bruna Garcia2018 proposed the use of a graph neural network (GNN) for aggregating node information in an iterative fashion via a message-passing model, where the support and query images are densely connected in the graph. The edge-labeling graph neural network (EGNN) modified this approach, using edge- rather than node-label information, combined with inter-cluster dissimilarity and intra-cluster similarity measures kim2019edge. Like GNN, the Transductive Propagation Network (TPN) considered the graph nodes for representing the feature embeddings of the images Liu2019. However, instead of performing inductive inference (that is, predicting test images one by one), TPN used transductive inference to predict the labels of the entire test set at once, which alleviated the low-data problem in FSL and achieved state-of-the-art performance Liu2019.
2.2 Transfer Learning
In contrast to meta-learning, transfer learning is based on a more conventional supervised learning approach. Here, a model is pre-trained on a large dataset with an abundant number of examples per class. After pre-training on these base classes, the model is then transferred (i.e., fine-tuned) to the novel classes in a few-shot task.
The weight imprinting qi2018low method constructed classifiers for novel tasks by imprinting the centroids of the novel images’ feature embeddings on classifier weights. TransMatch yu2019transmatch extended this concept to semi-supervised settings. The dynamic few-shot object recognition system proposed by Gidaris and Komodakis introduced an attention module during training to learn the classifier weights gidaris2018dynamic. The dense classification network was another method based on imprinting lifchitz2019dense
. In addition, this method expanded the feature embeddings obtained, from the last layer, to a set of vectors when computing the cross-entropy loss during training on the base classes in the training. All cross-entropy loss terms were aggregated to compute the overall loss during backpropagation.
The Looking-Back method we propose in this paper (Figure 1) uses the same graph construction approach as TPN Liu2019. However, Looking-Back incorporates the feature embeddings from hidden layers in the graph construction procedure as well. We shall note that the simultaneous training with graphs built on lower-level information could also be seen as a particular case of multi-task learning or incremental learning, which was mentioned in mallya2018packnet but is rarely adopted in FSL.
3 Proposed Method
In this section, we introduce our proposed Looking-Back approach utilizing lower-level information to enhance the predictive performance of FSL models.
3.1 Problem Definition
The goal of FSL is to train predictive models that learn from and perform well on classification tasks, given only a few labeled examples per class. For instance, -way -shot classification can be understood as a classification task with unique classes, where labeled examples per class are provided for supervised learning.
In an -way -shot setting, the dataset for a given task is divided into a support set and a query set . consists of examples and the corresponding class labels . The goal is to utilize to predict the class labels for the examples in , .
Given a large training dataset , with base classes , FSL meta-learning approaches sample many different -way -shot classification tasks randomly from , to train the meta-learner for episodes. After training, the meta-learner is given a novel -way -shot classification task , such that the classes do not overlap with the base classes in encountered during training. The dataset corresponding to is split into support and query sets, and the meta-learner uses the labeled examples in the support set to classify the examples in the query set.
A successful FSL meta-learner learns from the training tasks how to efficiently utilize the few labeled examples in the support set of a novel task so that the resulting model is able to predict the class labels in the unlabeled query set with good generalization performance.
Considering the general problem definition of FSL and meta-learning given above, the examples in the query set can be used in a transductive manner as suggested by Liu2019. I.e., instead of classifying the query examples one at a time, the whole query set can be propagated into the network all at once, which improves the predictive performance compared to classifying each query example independently Liu2019.
3.2 Feature Extractor Module
The two predominant types of neural network backbone architectures used in FSL research are ResNet-12 mishra2017simple; oreshkin2018tadam; lee2019meta; sun2019meta and Conv-64F vinyals2016matching; snell2017prototypical; sung2018learning; Garcia2018; li2019revisiting; Liu2019. In this work, we adopt Conv-64F since it is easier to experiment with. However, we shall note that our proposed method is architecture-agnostic and can be implemented for other types of feedforward neural networks.
Conv-64F contains four convolutional blocks where every block is constructed by one convolutional layer with 64 filters of size 3
Besides extracting feature embeddings from the last layer of the last convolutional block, the proposed Looking-Back also extracts the embeddings from the last layer of the second and third convolutional block. These three feature embeddings are then used in the graph-based label propagation, as illustrated in Figure 1. The dimensions of the feature embeddings extracted by the three convolutional blocks are 642121, 641010, and 6455, respectively. Here, the number of channels, 64, is determined by the Conv-64F architecture, whereas the channel heights and widths are a consequence of the input image dimensions given the Conv-64F architecture.
3.3 Graph Construction Module
In the original work of TPN Liu2019, the authors proposed a pair-wise similarity function that used an example-wise length-scale parameter. Adopting this mechanism, for the output of -th convolutional block, we compute the similarity of two images () via
which measures the distance between the two feature embeddings. Here is computed by a relation network module. As illustrated in Figure 2, we use a separate relation network for the second, third, and fourth convolutional block, since the dimensions and information contents of the respective feature embeddings differ.
The overall architecture of the relation network module, which compute and , is similar to the architecture used in TPN Liu2019. For instance, each relation network module consists of two convolutional blocks, followed by two fully-connected layers. Each convolutional block is composed of a 33 convolutional layer with a stride of 1, a batch normalization layer, ReLU activation, and a 22 max-pooling layer with a stride of 1.
In the Looking-Back model, we compute multiple symmetric normalized graph Laplacians chung1997spectral via
where is the diagonal matrix whose -th diagonal element is the sum of the -th row of the -th Laplacian . Similar to TPN, we keep the -max values when constructing multiple -nearest neighbor graphs during episodic training to improve computational efficiency.
3.4 Classification Loss
After constructing multiple nearest neighbor graphs as explained in Section 3.3, we use label propagation zhou2004learning, similar to TPN Liu2019, to compute the prediction (i.e., class-membership) scores for the query images.
Let be an initial score matrix. For a given image in the support set,
The label propagation process is an iterative process
where is the predicted label at time step . The predicted scores for an input image’s feature embedding from the -th convolutional block are computed via
is the identity matrix,is the normalized graph Laplacian of that feature embedding from the -th convolutional block, and
is a hyperparameter controlling propagation rate.
After computing the prediction scores, we obtain class-membership probability scores for the feature embeddings from the-th convolutional block by applying a softmax function as follows:
where is the predicted class label for feature embedding of the -th input image from the -th convolutional block, and is the predicted score at the -th position.
The total loss term is the combination of cross-entropy loss for different layers’ features:
where is a relative weight for the cross-entropy loss term of the feature embeddings from the -th convolutional block and is a hyperparameter during the episodic training.
The feature embeddings from the second () and third () convolutional block containing lower-level information are only used during training to improve the feature extractor module (Section 3.2). In both the validation and test stage, the class labels are obtained from the prediction on feature embeddings of the last convolutional block only, that is, the fourth convolutional block, .
In this section, we evaluate the proposed Looking-Back method on two popular FSL benchmark datasets, i.e., miniImageNet Ravi2017 and tieredImageNet ren2018meta, and compare with other state-of-the-art FSL methods.
miniImageNet. The miniImageNet dataset is widely used for comparing different few-shot learning methods Ravi2017. It is a small subset of ImageNet deng2009imagenet that consists of 100 classes with 600 examples per class. For our experiments, we split the dataset into 64 classes for training, 16 classes for validation, and 20 classes for testing following Ravi2017.
tieredImageNet. Similar to miniImageNet, the tieredImageNet dataset is a small, simplified version of ImageNet proposed by ren2018meta. Different from miniImageNet, tieredImageNet has a hierarchical or tiered structure consisting of 34 larger classes, where each larger class contains 10 to 30 smaller classes (i.e., related subcategories). tieredImageNet contains 608 smaller classes and 779,165 images in total. We split the dataset as described in ren2018meta, resulting in a training set consisting of 20 larger classes, a validation set consisting of 6 larger classes, and the test set consisting of 8 larger classes. The advantage of splitting the dataset based on the larger classes, as opposed to splitting into the subclasses, is that this approach creates a clearer distinction between training, test, and validation sets.
4.2 Implementation Details
As mentioned before, we adopted the Conv-64F architecture (Section 3.2) as the backbone for our model. During training, we used the three layers’ feature embeddings as shown in Figure 1 and Figure 2. For label propagation, we chose the same hyperparameters as described in Liu2019, setting (the propagation coefficient, Eq. 4 and 5) to 0.99 and (the per-row max values of the graph Laplacians) to 20. Moreover, we gave equal weighting to the individual loss terms when computing the total loss Eq. 7, that is, setting , , and to 1.
During the episodic training, each episode was a 5-way -shot task with -query images in each task, mimicking the testing scenario. We used the Adam optimizer kingma2014adam to train the model and set the initial learning rate to 0.001. For miniImageNet, the learning rate was decayed by a multiplicative factor of 0.8 every 5,000 episodes. The same multiplicative factor was used for decaying the learning rate when training on tiered
ImageNet, but it was decayed more frequently, every 2,000 epochs, due to the larger size and complexity oftieredImageNet.
To evaluate the model on the test set, we randomly sampled 600 -way -shot tasks from an independent test set with and , respectively. In both scenarios, and , there were query samples in each class (that is, query examples in total), which were used to compute the prediction accuracy for a given task or episode. To compute the overall prediction accuracy of a given model, we randomly sampled the test set times and calculated the accuracy by averaging the prediction accuracy across these episodes.
4.3 Results and Discussion
In this section, we compare our proposed Looking-Back method to other state-of-the-art FSL methods. All neural network implementations are based on a Conv-64F backbone architecture for feature extraction as described in Section3.2. Following the established conventions, we consider both 5-way 1-shot and 5-way 5-shot settings for the performance comparisons, using the two common FSL benchmark datasets miniImageNet and tieredImageNet as described in Section 4.1. The accuracy is computed as the average of 600 test episodes (as described in Section 4.2
) with a 95% confidence interval. As the results forminiImageNet (Table 1) and tieredImageNet (Table 2) indicate, our proposed Looking-Back method achieves state-of-the-art results on both datasets, in both the 5-way 1-shot and 5-way 5-shot scenarios.
|Matching Net vinyals2016matching||Conv-64||43.56 0.84||55.31 0.73|
|Prototypical Net Fort2017||Conv-64||49.42 0.78||68.20 0.66|
|Relation Net sung2018learning||Conv-64||50.44 0.82||65.32 0.70|
|Reptile nichol2018first||Conv-64||49.97 0.32||65.99 0.58|
|GNN Garcia2018||Conv-64||49.02 0.98||63.50 0.84|
|MAML finn2017model||Conv-64||48.70 1.84||63.11 0.92|
|TPN Liu2019||Conv-64||53.75 0.86||69.43 0.67|
|Looking-Back||Conv-64||55.91 0.86||70.99 0.68|
|Prototypical Net Fort2017||Conv-64||53.31 0.89||72.69 0.74|
|Relation Net sung2018learning||Conv-64||54.48 0.93||71.31 0.78|
|Reptile nichol2018first||Conv-64||52.36 0.23||71.03 0.22|
|MAML finn2017model||Conv-64||51.67 1.81||70.30 1.75|
|TPN Liu2019||Conv-64||57.53 0.96||72.85 0.74|
|Looking-Back||Conv-64||58.97 0.97||73.59 0.74|
|miniImageNet||TPN||55.51 0.86||69.86 0.65|
|Looking-Back||56.49 0.83||70.47 0.66|
|tieredImageNet||TPN||59.91 0.94||73.30 0.75|
|Looking-Back||61.19 0.92||73.78 0.74|
Comparing Looking-Back and TPN training in a "Higher Shot" setting. The performance comparisons between Looking-Back and TPN Liu2019 (Table 1 and 2), provide supportive evidence that utilizing lower-level information, which is contained is previous layers’ feature embeddings and utilized by Looking-Back, improves the predictive performance by a substantial amount. In this section, we investigate whether the lower-level information can also enhance the performance in a "Higher Shot" setting.
In FSL, it is common to use support sets of similar size during meta-training and testing. However, some researchers found that using larger support sets during meta-training (i.e., increasing the number of "shots") can improve the predictive performance of FSL systems based on evaluation on the same (i.e., smaller shot) test sets snell2017prototypical; li2019revisiting. Similar observations have been made in the original TPN paper Liu2019, where the authors described that increasing the number of examples in the support sets during meta-training (referred to as "Higher Shot") can improve the predictive accuracy during testing. However, using a larger number of shots during meta-training than testing does not always improve the predictive performance, and it is still an open area of research cao2019theoretical.
Although "Higher Shot" training is not the focus of this paper, we conducted experiments with higher shots and report the results in Table 3, adopting the procedure described in the original TPN paper Liu2019 to enable fair comparisons. The results in Table 3 indicate that Looking-Back utilizing lower-level information outperforms TPN in a "Higher Shot" setting as well.
Table 4 summarizes the performance gain of Looking-Back over TPN for the regular meta-training scenario (same number of shots in the training and test tasks, Table 1 and 2) and meta-training with higher shots (Table 3). From Table 4, we can observe that on both datasets, the improvement of same versus higher shot meta-training in 1-shot settings is more significant than in 5-shot settings. We argue that when more support images are available (higher shot), the role of utilizing lower-level information becomes less important. The main rationale behind using previous layers’ feature embeddings is to use additional lower-level information when information from the final layer’s feature embedding is scarce. Intuitively, the role of using lower-level information degrades if a meta-learner can utilize a larger number of examples in the support set.
Influence of higher shot training on Looking-Back. As indicated by the results in Table 4 and hypothesized in the previous section, our Looking-Back method could be more useful when the data is more scarce. This is likely because the more information is available during training (i.e., the support sets consist of additional examples in higher-shot settings), the more negligible the information from earlier layers becomes as supportive information.
In a 1-shot setting, we were still able to observe that the lower-level information used by Looking-Back models benefits the model performance when training in the higher shots setting, as summarized in Table 5. However, in the presence of a larger number of images, using lower-level information during training results in more limited improvements (5-shot test setting on tieredImageNet) or may have a small detrimental impact (5-shot test settings on miniImageNet) as shown in Table 5. This finding provides further evidence that the lower-level information has a more beneficial effect when the data is more scarce.
|Dataset||Setting||2nd layer||3rd layer||4th layer|
|miniImageNet||1-shot||42.24 0.76||50.87 0.81||55.91 0.86|
|5-shot||58.10 0.72||67.07 0.69||70.99 0.68|
|tieredImageNet||1-shot||46.25 0.87||54.70 0.93||58.97 0.97|
|5-shot||61.12 0.75||69.94 0.74||73.59 0.74|
Why only using the last layer’s information during inference. Both DN4 li2019revisiting and the dense classification network lifchitz2019dense use the entire expanded feature embeddings of the last layer during training as well as inference. One of the main reasons we only use the feature embeddings of the last layer during inference is that the lower-level information from previous layers is used to augment the graph construction during training but does not have equal relevance for the prediction task during inference. In contrast to Looking-Back, in both DN4 and the dense classification network, the additional information of the expanded feature embeddings are on the same footing.
To test our hypothesis that the feature embeddings of the last layer bear the most relevance for the prediction task, we compared the prediction accuracy of Looking-Back when using different layers for the class label prediction. As indicated by the results in Table 6, the prediction accuracy of the 4th (last) layer is higher than the prediction accuracy of the 3rd layer, and the accuracy of the 3rd layer is higher than the accuracy of the 2nd layer, supporting the hypothesis that the last layer contains the most useful information.
In this paper, we propose a new approach to FSL, capturing additional information inside the feature extracting network to improve prediction performance. In particular, the proposed Looking-Back method employs a graphical structure to utilize the lower-level information from previous layers’ feature embeddings, which differs from existing methods that only focus on expansions of the last layer’s feature embeddings. Experiments on two popular FSL datasets provide evidence for the benefits of using lower-level information in FSL.
Conceptualization, Z.Y., S.R.; investigation and validation, S.R., Z.Y; data curation, Z.Y.; writing–original draft preparation, Z.Y, S.R.; writing–review and editing, Z.Y, S.R.; visualization, Z.Y, S.R.; supervision, S.R.; project administration, S.R.; funding acquisition, S.R. All authors have read and agreed to the published version of the manuscript. Support for this review article was provided by the Office of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin-Madison with funding from the Wisconsin Alumni Research Foundation.
The authors declare no conflict of interest.