Information Symmetry Matters: A Modal-Alternating Propagation Network for Few-Shot Learning

09/03/2021 ∙ by Zhong Ji, et al. ∙ Aberystwyth University Tianjin University 0

Semantic information provides intra-class consistency and inter-class discriminability beyond visual concepts, which has been employed in Few-Shot Learning (FSL) to achieve further gains. However, semantic information is only available for labeled samples but absent for unlabeled samples, in which the embeddings are rectified unilaterally by guiding the few labeled samples with semantics. Therefore, it is inevitable to bring a cross-modal bias between semantic-guided samples and nonsemantic-guided samples, which results in an information asymmetry problem. To address this problem, we propose a Modal-Alternating Propagation Network (MAP-Net) to supplement the absent semantic information of unlabeled samples, which builds information symmetry among all samples in both visual and semantic modalities. Specifically, the MAP-Net transfers the neighbor information by the graph propagation to generate the pseudo-semantics for unlabeled samples guided by the completed visual relationships and rectify the feature embeddings. In addition, due to the large discrepancy between visual and semantic modalities, we design a Relation Guidance (RG) strategy to guide the visual relation vectors via semantics so that the propagated information is more beneficial. Extensive experimental results on three semantic-labeled datasets, i.e., Caltech-UCSD-Birds 200-2011, SUN Attribute Database, and Oxford 102 Flower, have demonstrated that our proposed method achieves promising performance and outperforms the state-of-the-art approaches, which indicates the necessity of information symmetry.



There are no comments yet.


page 1

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep learning has become the dominant technology on computer vision tasks. However, the performance is severely limited by the amount of labeled data, which is difficult or even infeasible to be acquired due to the high annotation cost or the scarcity of rare categories. By contrast, humans can recognize a new object with only one or limited observations based on abundant prior knowledge learned before. Inspired by this, learning to recognize new classes with few samples, called Few-Shot Learning (FSL) [3, 39], has attracted great attention recently.

Fig. 1: An illustration of information asymmetries between support set and query set in FSL. (a) Embeddings with only visual features; (b) Support embeddings fused with visual features and semantic vectors and query embeddings with only visual features; (c) Query embeddings fused with visual features and propagated pseudo-semantic vectors.

One intuitive solution for FSL is to employ the experience learned from other similar tasks. Meta-learning [11, 51], also known as “learning to learn”, aims at learning new concepts or skills rapidly with a few training examples based on abundant prior knowledge learned from base classes. Thus, the meta-learning framework has been widely employed on FSL and achieved promising performance [34, 31, 32].

Many studies demonstrate that humans capture object concepts from not only the visual view but also the language describing the characteristics of the objects [13, 36]. Thus, some FSL approaches [41, 5, 19, 12] utilize auxiliary semantic information, i.e., word embeddings or attribute annotations, to enhance the feature representations and improve the performance. For example, Li [19] modified the visual embeddings according to the relationship between the visual distance and the semantic similarity of different categories. Huang [12] employed an attribute-guided attention mechanism to augment the representations with the guidance of semantics. However, since the semantic information of query samples is unavailable, the utilization of query embeddings and the support embeddings enhanced with semantics will inevitably produce a cross-modal embedding bias, which may lead to an information asymmetry problem, as shown in Figure 1(b).

To address this problem, we introduce the graph propagation model to obtain the query semantics by the completed relation information in visual modality. The graph structure [38] has natural advantages on modeling relationships among nodes, which is effective to propagate information from one node to another. The main idea of our method is to update the visual graph with the guidance of semantics, and then propagate the semantic graph with the information transferred from visual modality. With the alternating propagation in two modalities, the information asymmetry problem is alleviated significantly as the visual graph is rectified and the semantic graph is completed.

Considering the large discrepancy across modalities, it is essential to reduce the cross-modal shift so that the information in two modalities is beneficial for each other. A common approach is to constrain the embeddings of two modalities in a shared latent space with a penalty function. For example, Schonfeld [29]

learned shared cross-modal feature embeddings of visual and semantic modalities with a Variational Auto-Encoder (VAE) and aligned the embeddings with two elaborate loss functions. Tokmakov

[33] designed a soft constraint regularization which improves the robustness of the alignments. However, directly constraining the instance embeddings is inappropriate in FSL since it is difficult to maintain a balance between extracting discriminative features and aligning cross-modal embeddings in a low-data scenario. To this end, we focus on the relations among samples and design a new guidance strategy which is flexible to reduce the cross-modal discrepancy.

Specifically, we propose a Modal-Alternating Propagation Network (MAP-Net) to obtain semantic information of query samples and rectify the feature embeddings, which alleviates the information asymmetry problem in FSL. The MAP-Net constructs two graphs in two modalities, i.e., the visual graph and the semantic graph, which are propagated alternately with the guidance of each other modality. The semantic graph is incomplete since semantic embeddings of query samples are unavailable. Therefore, we transfer the relation information from visual modality to semantic modality which is essential to propagate and complete the semantic graph. After the propagation, the information asymmetries are mitigated significantly as the query semantics are generated. To reduce the cross-modal discrepancy, we propose a Relation Guidance (RG) strategy to modify the relationships in visual modality. We transfer the relation vectors with a relation transfer module which is trained with support-support pairs to obtain the rectified relationships.

Our highlights are summarized in three folds:

  • We propose a Modal-Alternating Propagation Network to propagate modal information alternately to generate the pseudo-semantics of query samples and rectify the feature embeddings, which is effective in alleviating the information asymmetry problem between support and query samples.

  • To overcome the discrepancy between modalities and obtain the accurate relation information among different samples, we propose a Relation Guidance strategy to guide visual relationships with the relationships in semantic modality. The visual relation vectors are transferred with a relation transfer module trained with support-support pairs to represent the relationships more accurately.

  • We conduct experiments on three benchmark datasets with attributes or text descriptions, i.e., Caltech-UCSD-Birds 200-2011, SUN Attribute Database and Oxford 102 Flower, to compare the proposed method with previous few-shot learning methods. The experimental results demonstrate that our method achieves promising performance for few-shot learning from the perspective of information symmetry.

Fig. 2: The framework of MAP-Net for the 4-way 1-shot task. Semantic information of support set is given, while it is unknown in query set. The pseudo-semantic embeddings of query samples are generated by semantic graph in the MAP-Module, and the output embeddings in two modalities are fused for classification.

Ii Related work

Ii-a Few-Shot Learning

Few-Shot Learning aims at learning novel concepts with only one or few objects, which has been widely studied in recent years. Most of the existing few-shot methods follow the meta-learning strategy, which is also known as learning to learn, to transfer prior knowledge obtained from a large amount of auxiliary data in the meta-training phase to the novel tasks. The meta-learning-based methods generally can be divided into three types: optimization-based methods, metric-based methods and data-augmentation-based methods.

The optimization-based methods learn sub-optimal parameters for every task as the initial parameters that can be quickly adapted to novel tasks by only a few steps of gradient descent. MAML [6] is the first optimization-based method that utilizes a second-order optimizing strategy with meta-learning framework to quickly update the parameters. To simplify the optimization, Nichol [23] utilized the first-order method to replace the second-order derivation in MAML. Rusu [28] updated parameters in a low-dimensional latent space which is more practical in low-data scenarios. Since the shared initialization may lead to the conflict over tasks, Baik [1] proposed a task-and-layer-wise attenuation to forget the prior information selectively.

In the metric-based methods, the embedding space is constructed to measure the similarities among different feature embeddings. The simplicity and efficiency make metric-based methods highly attractive in the field of few-shot learning. Matching Networks [34]

utilize an attention mechanism based on LSTM to learn to classify the novel samples. Snell

[31] proposed Prototypical Networks to measure the distance between each sample and the prototypes of corresponding class. Relation Network [32] utilizes a learnable metric method instead of manual measurement, which is more flexible in metric-based classifications.

The main idea of the data-augmentation-based methods is to alleviate the lack of labeled data with data augmentation. Wang [40] generated samples with the idea of GAN to expand the diversity of data. Zhang [47] utilized a saliency detection method to fuse foreground and background in different images to augment samples. In order to alleviate the mode collapse problem in GAN, Li [20] employed the cWGAN in few-shot learning to ensure the diversity of generated data.

Ii-B Learning with Semantic Information

Semantic information usually plays a crucial role in various tasks, such as image-text matching [37, 2], emotion recognition [18, 49] and zero-shot learning [7, 15]. When the samples in visual modality are scarce, the semantic information is a good choice to assist model in training. Many zero-shot learning methods align visual and semantic representation to achieve the classification of novel classes without labeled samples. For example, Frome [7] presented a deep visual-semantic embedding model to learn the semantic relationships among classes with the semantic data, and map the visual samples into a semantic space to be classified. Ji [15] employed a semantic embedding space to transfer knowledge from seen domain to unseen domain with an attribute-guided network to address the cross-modal zero-shot hashing retrieval tasks. Considering the potential bias between seen and unseen classes, Gao [8] utilized a joint generative model to generate high-quality unseen features, which is further augmented with a self-training strategy. Guan [10] learned a robust cross-modal projection by synthesizing the unseen class data and designing a novel projection learning model to best utilize the synthesized data.

Based on the success of zero-shot learning, some methods utilizing auxiliary semantic information are proposed to boost the few-shot learning in recent years. Xing [41] integrated the semantic embeddings into visual features with an adaptive convex combination to assist the classification. Similarly, SAP-Net [14] refines the embeddings with the guidance of semantic information. Schwartz [30] further improved the few-shot learning with multiple semantics. For the discrepancy between visual and semantic modalities, Tokmakov [33] designed a semantic-based soft constraint regularization to learn the compositional representation of each sample. Chen [5] synthesized sample features in a semantic space with an encoder-decoder framework to increase the diversity of feature embeddings. Huang [12] employed the attention mechanism to emphasize or suppress the representation with the guidance of semantic information. Zhang [46] addressed few-shot classification in both relative and absolute views, which utilizes the semantic information and class labels simultaneously to represent the similarities among different samples and the absolute concept of instances.

Ii-C Propagation with Graph Model

Graph model is effective in constructing the relationship among different nodes, which has received great attention in recent years. Thanks to the great expressive power of graphs, especially the convincing performance of deep learning based Graph Neural Network (GNN)

[9], the graph-based methods have been employed in types of tasks, e.g., nodes classification [16], link prediction [48], and clustering [45].

In this case, various of graph-based methods are proposed for few-shot learning to extract the relationship and rectify the graph in an episode. In order to further explore the potential of graph model in few-shot learning, Kim [16] proposed EGNN to dynamically update both the nodes and edges and Yang [42] constructed both the distribution-level relations and instance-level relations with DPGN. Moreover, Label Propagation (LP) [50] is also a classical method to transfer knowledge from neighbors of each node, which is proved effective for few-shot classification with meta-learning framework. TPN [22] is the first method utilizing label propagation in few-shot learning. After this, Rodriguez [27] employed an Embedding Propagation method to yield a smoother embedding manifold. Similarly in zero-shot learning, Liu [21] optimized the semantic space with Attribute Propagation Network to refine the attributes of each class. Inspired by this, our method aims at propagating information both in visual and semantic space with cross-modal assistance, called Modal-Alternating Propagation (MAP-Net).

Iii Methodology

Iii-a Preliminary

We follow the episodic training paradigm as [34] for few-shot learning. In general, our model is trained on -way -shot settings. Each episode consists of categories from meta-training set and is divided into two parts: labeled samples in the support set and several query samples in the query set. Thus, the semantic information of support set is available. The support set contains totally labeled samples with their semantic vectors. It is denoted as , where represents the -th image, semantic vectors and the corresponding label. The query set contains samples with no semantic vectors. The episodic paradigm aims at obtaining the optimal performance on the query set by training the model with the support set.

Iii-B Overview

In this work, we propose a Modal-Alternating Propagation Network (MAP-Net) for few-shot learning to rectify the feature embeddings by constructing information symmetry. Figure 2 presents the main framework of MAP-Net, which consists of a modal-alternating propagation module (MAP-Module) and a feature fusion module. In the MAP-Module, two graphs in both visual and semantic modalities are constructed for learning to propagate information alternately to obtain the semantic vectors of query samples. Concretely, we first modify the visual embeddings with the guidance of semantic to reduce the high intra-class variance in visual modality. Then the semantic graph will be completed by applying the information transferred from visual space. We employ a Relation Guidance (RG) strategy to guide the visual embeddings. Rather than simply aligning two modal embeddings with a penalty function, we guide the relation information among samples in visual modality with that in semantic modality. With the guidance of relation map, both the visual graph and semantic graph are updated to obtain the symmetrical embeddings of support and query samples. Finally, the propagated visual and semantic embeddings are fused as augmented embeddings by a convex combination as

[41], which are employed to classify in corresponding categories.

Iii-C Modal-Alternating Propagation

Since the categories of query samples are unknown during test stage, their semantic information is unavailable. This may lead to an information asymmetry between support samples and query samples. To reduce the bias caused by the lack of the query semantics, we design a Modal-Alternating Propagation Module (MAP-Module) to generate the pseudo-semantic embeddings through updating graphs with propagating operation.

Graph Construction. There are two types of graphs, the visual graph and the semantic graph in MAP-Module. They transfer and propagate information to generate the pseudo query semantics. In the visual graph , the node is the feature embedding obtained from a feature encoder (CNN), which can be denoted as


where represents the feature encoder. For semantic graph, we utilize the attributes or text descriptions of samples as semantic information. Each node is also an embedding of semantic information encoded by a semantic encoder. The difference is the query semantic embeddings are initialized with zero vectors. is defined as:



is a multi-layer perceptron (MLP) used to encode the semantic vectors. Then the adjacency matrix

in is obtained from the similarities among nodes. In this paper, we employ Gaussian similarity function to calculate graph edges, where is the Euclidean distance between two neighbor nodes and is a scaling factor. Especially, we make the values of diagonal

to avoid self-reinforcement. In this paper, we utilize the standard deviation of distance matrix

as as in [27].

The adjacency matrix of semantic graph is similar as . However, the query semantic vectors are inapplicable to provide the relationships among different nodes in semantic graph. To address this problem, we explore to transfer the relation information from the visual graph to the semantic graph to acquire the completed adjacency matrix , which can be applied to propagate information to generate the pseudo-semantic embeddings of query samples.

Fig. 3: Illustration of Relation Guidance (RG) strategy. There are two stages in RG strategy. Firstly, we train the Relation Transfer module with support-support pairs in visual and semantic modalities. Then, the entire visual relation map is rectified with the trained Relation Transfer module.

Relation Guidance. A regular way to guide the relation information across modal is regarding as the adjacency matrix in semantic graph. However, the visual feature embeddings are difficult to represent the corresponding samples correctly due to the unrelated information in images (e.g., background), while the semantic vectors are easier to discriminate. Therefore, is inappropriate for the relationships among samples, especially when the samples in each episode are scarce so that the distribution is incorrect. To obtain an appropriate adjacency matrix, we propose a Relation Guidance (RG) strategy to modify the relationships among visual samples with the guidance of support semantic vectors.

Specifically, we first obtain the relation maps and of two modalities. Each position of relation map is a relation vector representing the difference between two samples, which is calculated as follows:


The relation map can be denoted by four parts as follows:


where the relation vectors of the query-support and query-query pairs in semantic embeddings are 0. Thus, we design RG to employ the corresponding known parts to inference the unknown parts. Its core idea is to utilize the guidance of relations of support-support pairs in semantic graph to transfer to , which is more accurate to represent the relationships among different samples. Concretely, we set as training samples and as corresponding labels to train the relation transfer module with the mean square error (MSE) as loss function:


where . Then all relation vectors in will be fed into RG-module to obtain the modified :


Then the rectified adjacency matrix can be acquired as follows:


where represents the distance between and , which is the -norm of the rectified relation vector.

Graph Propagation. With rectified adjacency matrix, both the visual graph and the semantic graph can be updated with graph propagation. Concretely, the matrix is first symmetrically normalized as


where is the degree matrix of graph. Then, we follow the label propagation operation as in [22] and get the propagation matrix as


where is a smoothing factor and

is the identity matrix. Finally, the visual embeddings can be rectified as


and the completed semantic embeddings can be obtained as


With guidance of the pseudo query semantic embeddings and the existing support semantic embeddings, the feature embeddings are modified to be more discriminative. Most importantly, the information asymmetries between support set and query set are reduced significantly.

Iii-D Feature Fusion and Classification

After the modal-alternating propagation module, the information asymmetries are reduced significantly as the visual feature embeddings and semantic embeddings are both available for every sample. Since the information of the two modalities is complementary to enhance the features, we fuse embeddings by employing a convex combination to obtain more discriminative embeddings, which is denoted as:


where is a coefficient learned with a weight learner:


where is an MLP and represents the concatenating operation. Then following the operation of Prototypical Networks [31], we calculate the classification loss as follows:


where is the Euclidean distance,

is the probability that query sample

belongs to class and is the prototype of augmented features in class .


Therefore, the total loss is


where is the weight coefficient of relation transfer loss.

Methods Semantic Backbone Accuracy
5-way 1-shot 5-way 5-shot
MatchingNet [34] N ConvNet4 60.52 0.88% 75.29 0.75%
ProtoNet [31] N ConvNet4 50.46 0.88% 76.39 0.64%
RelationNet [32] N ConvNet4 62.34 0.94% 77.84 0.68%
MAML [6] N ConvNet4 54.73 0.97% 75.75 0.75%
ARML [43] N ConvNet4 62.33 1.47% 73.34 0.70%
SoSN-ArL [46] Y ConvNet4 50.62% 65.87%
AM3 [41] Y ConvNet4 73.78 0.28% 81.39 0.26%
AGAM [12] Y ConvNet4 75.87 0.29% 81.66 0.25%
MAP-Net (Ours) Y ConvNet4 80.92 0.21% 85.88 0.17%
MatchingNet [34] N ResNet12 60.96 0.35% 77.31 0.25%
ProtoNet [31] N ResNet12 68.8% 76.4%
RelationNet [32] N ResNet12 60.21 0.35% 80.18 0.25%
MAML [6] N ResNet18 69.96 1.01% 82.70 0.65%
TADAM [25] N ResNet12 69.2% 78.6%
FEAT [44] N ResNet12 68.87 0.22% 82.90 0.15%
AFHN [20] N ResNet18 70.53 1.01% 83.95 0.63%
Comp. [33] Y ResNet10 53.6% 74.6%
AM3 [41] Y ResNet12 73.6% 79.9%
Dual TriNet [5] Y ResNet12 69.61 0.46% 84.10 0.35%
Multi-Sem. [30] Y DenseNet121 76.1% 82.9%
AGAM [12] Y ResNet12 79.58 0.23% 87.17 0.23%
MAP-Net (Ours) Y ResNet12 82.45 0.23% 88.30 0.17%
TABLE II: Few-shot classification accuracy on CUB with

95% confidence intervals.

denotes that the accuracies are reported in [12].

Iv Experiments

Iv-a Experiment Setup

Datasets Training set Validation set Testing set Semantics
CUB-200-2011 [35] 100 50 50 Attributes
SUN Attribute [26] 580 65 72 Attributes
Flowers 102 [24] 60 20 22 Text-descriptions
TABLE I: Information of Datasets.

Datasets. We conduct the experiments on three benchmark datasets with semantic information, i.e., attributes or text descriptions: Caltech-UCSD-Birds 200-2011 (CUB) [35], SUN Attribute Database (SUN) [26] and Oxford 102 Flower (Flower) [24]. The details about datasets are shown in Table I. CUB is a fine-grained dataset of bird species that consists of 11788 images of 200 categories and 312 attributes for each class. We follow the split in [3], which selects 100, 50, 50 classes for training, validation and testing respectively. SUN contains 14340 scene images of 717 classes with 102 attributes. Following [12], 580, 65, 72 classes will be employed for training, validation and testing. Flower is also a fine-grained dataset with 102 classes of flower species, where the number of images is varied from 40258 in each class. Each image has a text description in 1024 dimensions. 60, 20, 22 classes are used for training, validation and testing. All images in these datasets are resized to 8484 for fair comparisons.

Methods Backbone Accuracy
5-way 1-shot 5-way 5-shot
MatchingNet [34] ConvNet4 55.72 0.40% 76.59 0.21%
ProtoNet [31] ConvNet4 57.76 0.29% 79.27 0.19%
RelationNet [32] ConvNet4 49.58 0.35% 76.21 0.19%
Comp. [33] ResNet10 45.9% 67.1%
AM3 [41] ConvNet4 62.79 0.32% 79.69 0.23%
AGAM [12] ConvNet4 65.15 0.31% 80.08 0.21%
MAP-Net (Ours) ConvNet4 67.73 0.30% 80.30 0.21%
TABLE III: Few-shot classification accuracy of SUN with 95% confidence intervals. denotes that the accuracies are reported in [12].
Methods Backbone Accuracy
5-way 1-shot 5-way 5-shot
ProtoNet [31] ConvNet4 62.81% 82.11%
RelationNet [32] ConvNet4 68.26% 80.94%
AM3 [41] ConvNet4 74.00 0.29% 88.67 0.18%
AGAM [12] ConvNet4 73.27 0.29% 89.49 0.17%
SoSN-ArL [46] ConvNet4 76.21% 88.39%
MAP-Net (Ours) ConvNet4 77.41 0.28% 90.63 0.16%
TABLE IV: Few-shot classification accuracy of Flowers with 95% confidence intervals. denotes that it is our implementation. denotes that the accuracies are reported in [46].
5-way 1-shot 5-way 5-shot 5-way 1-shot 5-way 5-shot 5-way 1-shot 5-way 5-shot
75.30 0.25% 80.28 0.22% 63.11 0.30% 78.91 0.21% 74.78 0.28% 88.81 0.17%
78.14 0.24% 81.42 0.22% 59.12 0.31% 76.51 0.25% 74.97 0.28% 89.76 0.17%
79.53 0.23% 84.08 0.19% 66.90 0.29% 79.82 0.23% 76.14 0.28% 90.19 0.17%
80.03 0.23% 84.52 0.19% 66.83 0.29% 79.70 0.23% 76.28 0.28% 89.78 0.17%
79.98 0.22% 85.15 0.19% 67.35 0.30% 79.88 0.24% 76.68 0.29% 90.38 0.16%
80.92 0.21% 85.88 0.18% 67.73 0.30% 80.30 0.21% 77.41 0.28% 90.63 0.16%
TABLE V: Ablation results of main components. Notations: ‘VP’ - Visual Propagation, ‘SP’ - Semantic Propagation, ‘RG’ - Relation Guidance.

Experimental Settings. We conduct experiments on 5-way 1-shot and 5-way 5-shot settings, and 15 query samples are employed for both meta-training and meta-testing in each episode. The average accuracy (%) and the corresponding 95% confidence interval over the 5000 episodes are reported to express the performance. To make a fair comparison, our method is trained in the inductive setting. This is to say, we apply only one query sample to build corresponding graph at a time.

Implementation Details. We utilize two popular convolution networks ConvNet-4 [34] and ResNet-12 [4]

as backbones to extract the visual features of images, while the semantic embeddings are encoded with a Multi-Layer Perceptron (MLP). In the meta-training stage, we train the model with 60 epochs with 1000 episodes and 600 episodes for validation per epoch both in 1-shot and 5-shot settings. We randomly select K support samples and 15 query samples in each episode. The Adam

[17] optimizer is utilized with the initial learning rate 0.001 which will be reduced by 0.1 in every 15 epochs. The batch size is 5 for ConvNet-4 and 1 for ResNet-12. For MAP-Net, the smoothing factor is set to 0.2 and weight coefficient is 1 for CUB and 0.1 for SUN and Flower.

Iv-B Comparison with State-of-the-art Methods

We choose fourteen popular meta-learning based methods as competitors, including both nonsemantic-based methods and semantic-based methods:

(1) Nonsemantic-based methods.

  • Metric-based methods: Matching Networks [34], Prototypical Networks [31], Relation Networks [32], TADAM [25] and FEAT [44].

  • Optimization-based methods: MAML [6], ARML [43].

  • Data-augmentation-based methods: AFHN [20].

(2) Semantic-based methods.

  • Guiding embeddings with semantics: AM3 [41], Multi-Semantics [30], Comp. [33], SoSN-ArL [46], AGAM [12].

  • Generating embeddings with semantics: Dual TriNet [5].

We compare our MAP-Net with these methods with both ConvNet-4 and ResNet-12 on CUB. Since there is no method in previous conduct the experiments with ResNet-12 on SUN and Flower, we compare our method with these methods on them only with the ConvNet-4. Table II, III, IV show the results on three benchmark datasets respectively.

It can be observed that our method outperforms all methods on three datasets in cases of ConvNet-4 and ResNet-12 backbones. Particularly, our MAP-Net achieves 5.05% and 4.22% performance gains on CUB, 2.58% and 0.22% on SUN, and 1.2% and 1.14% on Flower with the backbone ConvNet-4 against the second-best method on both 1-shot and 5-shot settings. For ResNet-12 on CUB, MAP-Net also obtains the best performance on both 1-shot and 5-shot settings, which outperforms the second-best approach AGAM [12] in 2.87% and 1.13% respectively.

Besides, we have the following three observations. (1) The improvements are more significant on 1-shot setting than those on 5-shot setting. It proves that the auxiliary semantic information is more beneficial in the occasion with extremely few visual samples. (2) The improvements with ResNet-12 are inferior to those with ConvNet-4. It demonstrates that the complex feature encoder suppresses the effectiveness of semantic information. When the feature encoder extracts more discriminative embeddings, the balance between visual and semantic modalities is broken due to that the information carried from semantic vectors is finite. (3) It is observed that the performance is improved slightly on 5-shot setting of SUN. We suppose the main reason is that the attribute vectors of SUN are too sparse that the effectiveness is suppressed when the visual information increasing.

Fig. 4: Comparison of our method and two baseline methods with different number of shots on CUB, SUN, Flower.
Fig. 5: Average value of for support samples and query samples respectively with different number of shots learned on CUB, SUN, Flower.

Iv-C Ablation Studies

We conduct ablation studies to prove the effectiveness of the main components in MAP-Net on three datasets, which are shown in Table V. We set the model without these components as the baseline, which is similar to AM3 [41]. The only difference is that AM3 combines the visual prototypes with the class semantic embeddings, while our baseline combines support visual features with their semantic embeddings.

It is observed that our method with all components obtains the best performance on three datasets. Both visual graph propagation and semantic graph propagation are beneficial in MAP-Net, especially the propagation in semantic modality. For example, compared with the baseline, the visual propagation brings 2.84% and 1.14% improvements on 1-shot and 5-shot on CUB, while the semantic propagation brings 4.23% and 3.8% improvements. For SUN and Flower, the semantic propagation also improves the accuracy by a large margin. It demonstrates that the information asymmetries between support and query samples are reduced significantly with semantic propagation. However, the visual propagation does not bring significant gains, and the performance is even damaged on SUN. We suppose the reason is that the information propagated from neighbor nodes in the visual graph might be useless or even harmful since the relationships among samples are not quite accurate in low-data scenarios.

Methods Backbone Accuracy
5-way 1-shot 5-way 5-shot
MAP-Net (w/o guidance) ConvNet4 80.03 0.23% 84.52 0.19%
MAP-Net (w IC) ConvNet4 78.62 0.24% 82.50 0.21%
MAP-Net (w RC) ConvNet4 79.13 0.24% 82.65 0.21%
MAP-Net (w RG) ConvNet4 80.92 0.21% 85.88 0.17%
TABLE VI: Results of different guidance methods on CUB. Notations: ‘IC’ - Instance Constraint, ‘RC’ - Relation Constraint, ‘RG’ - Relation Guidance.

For the same reason, the performance is also limited when utilizing the visual propagation and the semantic propagation simultaneously. By contrast, this issue is alleviated with Relation Guidance strategy, which brings about 1% on all settings. It indicates that with the guidance of semantic embeddings with accurate distribution, the rectified relation map represents the relationships among samples more correctly to propagate discriminative pseudo-semantic embeddings. Moreover, the adverse impact of visual propagation is also reduced.

Iv-D Further Analysis

Impact of Different Guidance Methods. In order to validate the efficiency of the Relation Guidance strategy, we design some other methods to guide the visual embeddings with semantic information. In this section, we compare the results on CUB with different guidance methods: Instance Constraint (IC) – constraining every cross-modal representation of support samples directly; Relation Constraint (RC) – constraining the corresponding relation vectors of support-support pairs in two modalities; and our proposed Relation Guidance (RG) method. The results are reported in Table VI.

We could observe that both IC and RC hurt the performance, which indicates that directly constraining cross-modal embeddings is inappropriate in few-shot learning. By contrast, MAP-Net with RG achieves promising performance. It demonstrated that the information propagated among nodes is more accurate with the Relation Guidance strategy. The relative information is more effective to be utilized for alleviating the cross-modal discrepancy.

Fig. 6: The t-SNE visualization of the embeddings learned by ProtoNet, AM3, MAP-Net on CUB respectively.

Analysis of Information Asymmetries. Figure 4 shows the accuracies of our method and two baseline methods (AM3 and ProtoNet) on three datasets on 1-10 shot settings. It can be observed that the improvement of AM3 compared with ProtoNet is reduced when the number of shots increased. The ProtoNet even surpasses AM3 with 10-shot setting on CUB and Flower datasets. By contrast, our method outperforms these two baselines whatever the number of shots is. We interpret the result in two aspects. Firstly, the semantic information is more essential in the scenario with extremely few visual samples (e.g., 1-shot), while the abundant visual information (e.g., 10-shot) can suppress the effectiveness of semantic. Secondly, more supervision information reflects more accurate relationships among different samples which are beneficial to generate the pseudo-semantic embeddings of query samples to reduce the information asymmetries. Therefore, our method also improves the performance as the number of shots increases.

Figure 5 shows the relationship between the weight coefficient and the number of shots. It is observed that the coefficient is larger as the number of shots increases as analyzed in [41]. Moreover, for query samples is larger than that for support samples. We consider the reason is that the pseudo-semantic embeddings of query samples propagated from adjacent support samples cannot represent the real semantic information thoroughly, though they supplement the missing information to some extent, which is worthy of further study in the future. We can also observe that with the increase of shots, the gap between the values of for support samples and query samples is decreasing. It means that the pseudo-semantic of query samples and real semantic of support samples are more symmetric with abundant relationship information constructed with more support samples, which also proves the effectiveness of modal-alternating propagation.

The difference among three datasets expresses the different representativeness of semantic information to corresponding samples in different datasets. The semantic information used in CUB is more representative since its weight is large, almost 0.9, on semantic embeddings, while those on SUN and Flower are relatively small.

Visualization Analysis. To prove the effectiveness of our method, we visualize the embedding space on CUB dataset by the t-SNE approach. It is observed in Fig. 6 that the embeddings obtained from AM3 and MAP-Net are more separable than Prototypical Networks with the auxiliary of semantic information. As for ProtoNet, one prototype acquired by averaging support samples is even far from the corresponding query samples, which extremely affects the classification results. Moreover, although the clusters of AM3 are relatively scattered, there is a large distance between the query embeddings and their prototypes, which is caused by the information asymmetries due to the lack of query semantic information. And two prototypes even overlap with each other. By contrast, the query embeddings and their prototypes of MAP-Net are more consistent with the generated pseudo-semantic embeddings.

V Conclusion

To address the information asymmetry problem when utilizing the auxiliary semantic information in few-shot learning, we have proposed a Modal-Alternating Propagation Network (MAP-Net) to supplement the lack of query semantics. The MAP-Net generates the query semantics and modifies the feature embeddings by employing the relationships to alternatingly propagate graphs in two modalities. Furthermore, with the proposed flexible Relation Guidance (RG) strategy, the large discrepancy between different modalities is reduced significantly. The relation vectors are more accurate to represent the true relationships among samples with the guidance of semantics. Extensive experiments on three datasets with semantic information have demonstrated the effectiveness of our proposed MAP-Net.


  • [1] S. Baik, S. Hong, and K. M. Lee (2020) Learning to forget for meta-learning. In CVPR, pp. 2379–2387. Cited by: §II-A.
  • [2] H. Chen, G. Ding, X. Liu, Z. Lin, J. Liu, and J. Han (2020) IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In CVPR, pp. 12655–12663. Cited by: §II-B.
  • [3] W. Chen, Y. Liu, Z. Kira, Y. F. Wang, and J. Huang (2019) A closer look at few-shot classification. In ICLR, pp. 1–16. Cited by: §I, §IV-A.
  • [4] Y. Chen, X. Wang, Z. Liu, H. Xu, and T. Darrell (2020) A new meta-baseline for few-shot learning. arXiv preprint arXiv:2003.04390. Cited by: §IV-A.
  • [5] Z. Chen, Y. Fu, Y. Zhang, Y. G. Jiang, X. Xue, and L. Sigal (2019) Multi-level semantic feature augmentation for one-shot learning. IEEE Transactions on Image Processing 28 (9), pp. 4594–4605. Cited by: §I, §II-B, TABLE II, 2nd item.
  • [6] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pp. 1126–1135. Cited by: §II-A, TABLE II, 2nd item.
  • [7] A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov (2013) DeViSe: a deep visual-semantic embedding model. In NeurIPS, pp. 2121–2129. Cited by: §II-B.
  • [8] R. Gao, X. Hou, J. Qin, J. Chen, L. Liu, F. Zhu, Z. Zhang, and L. Shao (2020) Zero-VAE-GAN: generating unseen features for generalized and transductive zero-shot learning. IEEE Transactions on Image Processing 29 (), pp. 3665–3680. Cited by: §II-B.
  • [9] V. Garcia and J. Bruna (2018) Few-shot learning with graph neural networks. In ICLR, pp. 1–13. Cited by: §II-C.
  • [10] J. Guan, Z. Lu, T. Xiang, A. Li, A. Zhao, and J. R. Wen (2020) Zero and few shot learning with semantic feature synthesis and competitive learning. IEEE transactions on pattern analysis and machine intelligence 43 (7), pp. 2510–2523. Cited by: §II-B.
  • [11] T. M. Hospedales, A. Antoniou, P. Micaelli, and A. J. Storkey (2021, Online) Meta-learning in neural networks: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–20. Cited by: §I.
  • [12] S. Huang, M. Zhang, Y. Kang, and D. Wang (2021) Attributes-guided and pure-visual attention alignment for few-shot recognition. In AAAI, pp. 7840–7847. Cited by: §I, §II-B, TABLE II, 1st item, §IV-A, §IV-B, TABLE III, TABLE IV.
  • [13] R. Jackendoff (1987) On beyond zebra: the relation of linguistic and visual information. Cognition 26 (2), pp. 89–114. Cited by: §I.
  • [14] Z. Ji, X. Liu, Y. Pang, W. Ouyang, and X. Li (2021) Few-shot human-object interaction recognition with semantic-guided attentive prototypes network. IEEE Transactions on Image Processing 30 (), pp. 1648–1661. External Links: Document Cited by: §II-B.
  • [15] Z. Ji, Y. Sun, Y. Yu, Y. Pang, and J. Han (2020) Attribute-guided network for cross-modal zero-shot hashing. IEEE transactions on neural networks and learning systems 31 (1), pp. 321–330. Cited by: §II-B.
  • [16] J. Kim, T. Kim, S. Kim, and C. D. Yoo (2019) Edge-labeling graph neural network for few-shot learning. In CVPR, pp. 11–20. Cited by: §II-C, §II-C.
  • [17] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, pp. 1–15. Cited by: §IV-A.
  • [18] R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza (2017) Emotion recognition in context. In CVPR, pp. 1667–1675. Cited by: §II-B.
  • [19] A. Li, W. Huang, X. Lan, J. Feng, Z. Li, and L. Wang (2020) Boosting few-shot learning with adaptive margin loss. In CVPR, pp. 12576–12584. Cited by: §I.
  • [20] K. Li, Y. Zhang, K. Li, and Y. Fu (2020) Adversarial feature hallucination networks for few-shot learning. In CVPR, pp. 13470–13479. Cited by: §II-A, TABLE II, 3rd item.
  • [21] L. Liu, T. Zhou, G. Long, J. Jiang, and C. Zhang (2020) Attribute propagation network for graph zero-shot learning. In AAAI, pp. 4868–4875. Cited by: §II-C.
  • [22] Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. J. Hwang, and Y. Yang (2019) Learning to propagate labels: transductive propagation network for few-shot learning. In ICLR, pp. 1–14. Cited by: §II-C, §III-C.
  • [23] A. Nichol, J. Achiam, and J. Schulman (2018) On first-order meta-learning algorithms. arXiv preprint arXiv: 1803.02999. Cited by: §II-A.
  • [24] M. E. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. In ICVGIP, pp. 722–729. Cited by: §IV-A, TABLE I.
  • [25] B. N. Oreshkin, P. R. Lopez, and A. Lacoste (2018) TADAM: task dependent adaptive metric for improved few-shot learning. In NeurIPS, pp. 721–731. Cited by: TABLE II, 1st item.
  • [26] G. Patterson, C. Xu, H. Su, and J. Hays (2014)

    The sun attribute database: beyond categories for deeper scene understanding

    International Journal of Computer Vision 108 (1-2), pp. 59–81. Cited by: §IV-A, TABLE I.
  • [27] P. Rodriguez, I. Laradji, A. Drouin, and A. Lacoste (2020) Embedding propagation: smoother manifold for few-shot classification. In ECCV, pp. 121–138. Cited by: §II-C, §III-C.
  • [28] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2019) Meta-learning with latent embedding optimization. In ICLR, pp. 1–17. Cited by: §II-A.
  • [29] E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata (2019)

    Generalized zero-and few-shot learning via aligned variational autoencoders

    In CVPR, pp. 8247–8255. Cited by: §I.
  • [30] E. Schwartz, L. Karlinsky, R. Feris, R. Giryes, and A. M. Bronstein (2019) Baby steps towards few-shot learning with multiple semantics. arXiv preprint arXiv:1906.01905. Cited by: §II-B, TABLE II, 1st item.
  • [31] J. Snell, K. Swersky, and R. S. Zemel (2017) Prototypical networks for few-shot learning. In NeurIPS, pp. 4077–4087. Cited by: §I, §II-A, §III-D, TABLE II, 1st item, TABLE III, TABLE IV.
  • [32] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In CVPR, pp. 1199–1208. Cited by: §I, §II-A, TABLE II, 1st item, TABLE III, TABLE IV.
  • [33] P. Tokmakov, Y. Wang, and M. Hebert (2019) Learning compositional representations for few-shot recognition. In ICCV, pp. 6372–6381. Cited by: §I, §II-B, TABLE II, 1st item, TABLE III.
  • [34] O. Vinyals, C. Blundell, T. P. Lillicrap, K. Kavukcuoglu, and D. Wierstra (2016) Matching networks for one-shot learning. In NeurIPS, pp. 3637–3645. Cited by: §I, §II-A, §III-A, TABLE II, 1st item, §IV-A, TABLE III.
  • [35] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §IV-A, TABLE I.
  • [36] G. Wallis and H. Bulthoff (1999) Learning to recognize objects. Trends in cognitive sciences 3 (1), pp. 22–31. Cited by: §I.
  • [37] H. Wang, Y. Zhang, Z. Ji, Y. Pang, and L. Ma (2020) Consensus-aware visual-semantic embedding for image-text matching. In European Conference on Computer Vision, pp. 18–34. Cited by: §II-B.
  • [38] J. Wang, S. Xu, F. Zheng, K. Lu, J. Song, and L. Shao (2021) Learning efficient hash codes for fast graph-based data similarity retrieval. IEEE Transactions on Image Processing, pp. 1–14. Cited by: §I.
  • [39] Y. Wang, L. Zhang, Y. Yao, and Y. Fu (2021, Online) How to trust unlabeled data instance credibility inference for few-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–14. Cited by: §I.
  • [40] Y. Wang, R. B. Girshick, M. Hebert, and B. Hariharan (2018) Low-shot learning from imaginary data. In CVPR, pp. 7278–7286. Cited by: §II-A.
  • [41] C. Xing, N. Rostamzadeh, B. Oreshkin, and O. P. Pinheiro (2019) Adaptive cross-modal few-shot learning. In NeurIPS, pp. 4847–4857. Cited by: §I, §II-B, §III-B, TABLE II, 1st item, §IV-C, §IV-D, TABLE III, TABLE IV.
  • [42] L. Yang, L. Li, Z. Zhang, X. Zhou, E. Zhou, and Y. Liu (2020) DPGN: distribution propagation graph network for few-shot learning. In CVPR, pp. 13390–13399. Cited by: §II-C.
  • [43] H. Yao, X. Wu, Z. Tao, Y. Li, B. Ding, R. Li, and Z. Li (2020) Automated relational meta-learning. In ICLR, pp. 1–19. Cited by: TABLE II, 2nd item.
  • [44] H. J. Ye, H. Hu, D. C. Zhan, and F. Sha (2020) Few-shot learning via embedding adaptation with set-to-set functions. In CVPR, pp. 8808–8817. Cited by: TABLE II, 1st item.
  • [45] C. Zhang, D. Song, C. Huang, A. Swami, and N. V. Chawla (2019) Heterogeneous graph neural network. In ACM SIGKDD, pp. 793–803. Cited by: §II-C.
  • [46] H. Zhang, P. Koniusz, S. Jian, H. Li, and P. H. S. Torr (2021) Rethinking class relations: absolute-relative supervised and unsupervised few-shot learning. In CVPR, pp. 9432–9441. Cited by: §II-B, TABLE II, 1st item, TABLE IV.
  • [47] H. Zhang, J. Zhang, and P. Koniusz (2019) Few-shot learning via saliency-guided hallucination of samples. In CVPR, pp. 2770–2779. Cited by: §II-A.
  • [48] M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. NeurIPS 31, pp. 5165–5175. Cited by: §II-C.
  • [49] T. Zhang, W. Zheng, Z. Cui, Y. Zong, and Y. Li (2019)

    Spatial–temporal recurrent neural network for emotion recognition

    IEEE Transactions on Cybernetics 49 (3), pp. 839–847. Cited by: §II-B.
  • [50] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf (2004) Learning with local and global consistency. In NeurIPS, pp. 321–328. Cited by: §II-C.
  • [51] H. Zhu, L. Li, J. Wu, S. Zhao, G. Ding, and G. Shi (2020, Online) Personalized image aesthetics assessment via meta-learning with bilevel gradient optimization. IEEE Transactions on Cybernetics, pp. 1–14. Cited by: §I.