Deep learning has become the dominant technology on computer vision tasks. However, the performance is severely limited by the amount of labeled data, which is difficult or even infeasible to be acquired due to the high annotation cost or the scarcity of rare categories. By contrast, humans can recognize a new object with only one or limited observations based on abundant prior knowledge learned before. Inspired by this, learning to recognize new classes with few samples, called Few-Shot Learning (FSL) [3, 39], has attracted great attention recently.
One intuitive solution for FSL is to employ the experience learned from other similar tasks. Meta-learning [11, 51], also known as “learning to learn”, aims at learning new concepts or skills rapidly with a few training examples based on abundant prior knowledge learned from base classes. Thus, the meta-learning framework has been widely employed on FSL and achieved promising performance [34, 31, 32].
Many studies demonstrate that humans capture object concepts from not only the visual view but also the language describing the characteristics of the objects [13, 36]. Thus, some FSL approaches [41, 5, 19, 12] utilize auxiliary semantic information, i.e., word embeddings or attribute annotations, to enhance the feature representations and improve the performance. For example, Li  modified the visual embeddings according to the relationship between the visual distance and the semantic similarity of different categories. Huang  employed an attribute-guided attention mechanism to augment the representations with the guidance of semantics. However, since the semantic information of query samples is unavailable, the utilization of query embeddings and the support embeddings enhanced with semantics will inevitably produce a cross-modal embedding bias, which may lead to an information asymmetry problem, as shown in Figure 1(b).
To address this problem, we introduce the graph propagation model to obtain the query semantics by the completed relation information in visual modality. The graph structure  has natural advantages on modeling relationships among nodes, which is effective to propagate information from one node to another. The main idea of our method is to update the visual graph with the guidance of semantics, and then propagate the semantic graph with the information transferred from visual modality. With the alternating propagation in two modalities, the information asymmetry problem is alleviated significantly as the visual graph is rectified and the semantic graph is completed.
Considering the large discrepancy across modalities, it is essential to reduce the cross-modal shift so that the information in two modalities is beneficial for each other. A common approach is to constrain the embeddings of two modalities in a shared latent space with a penalty function. For example, Schonfeld 
learned shared cross-modal feature embeddings of visual and semantic modalities with a Variational Auto-Encoder (VAE) and aligned the embeddings with two elaborate loss functions. Tokmakov designed a soft constraint regularization which improves the robustness of the alignments. However, directly constraining the instance embeddings is inappropriate in FSL since it is difficult to maintain a balance between extracting discriminative features and aligning cross-modal embeddings in a low-data scenario. To this end, we focus on the relations among samples and design a new guidance strategy which is flexible to reduce the cross-modal discrepancy.
Specifically, we propose a Modal-Alternating Propagation Network (MAP-Net) to obtain semantic information of query samples and rectify the feature embeddings, which alleviates the information asymmetry problem in FSL. The MAP-Net constructs two graphs in two modalities, i.e., the visual graph and the semantic graph, which are propagated alternately with the guidance of each other modality. The semantic graph is incomplete since semantic embeddings of query samples are unavailable. Therefore, we transfer the relation information from visual modality to semantic modality which is essential to propagate and complete the semantic graph. After the propagation, the information asymmetries are mitigated significantly as the query semantics are generated. To reduce the cross-modal discrepancy, we propose a Relation Guidance (RG) strategy to modify the relationships in visual modality. We transfer the relation vectors with a relation transfer module which is trained with support-support pairs to obtain the rectified relationships.
Our highlights are summarized in three folds:
We propose a Modal-Alternating Propagation Network to propagate modal information alternately to generate the pseudo-semantics of query samples and rectify the feature embeddings, which is effective in alleviating the information asymmetry problem between support and query samples.
To overcome the discrepancy between modalities and obtain the accurate relation information among different samples, we propose a Relation Guidance strategy to guide visual relationships with the relationships in semantic modality. The visual relation vectors are transferred with a relation transfer module trained with support-support pairs to represent the relationships more accurately.
We conduct experiments on three benchmark datasets with attributes or text descriptions, i.e., Caltech-UCSD-Birds 200-2011, SUN Attribute Database and Oxford 102 Flower, to compare the proposed method with previous few-shot learning methods. The experimental results demonstrate that our method achieves promising performance for few-shot learning from the perspective of information symmetry.
Ii Related work
Ii-a Few-Shot Learning
Few-Shot Learning aims at learning novel concepts with only one or few objects, which has been widely studied in recent years. Most of the existing few-shot methods follow the meta-learning strategy, which is also known as learning to learn, to transfer prior knowledge obtained from a large amount of auxiliary data in the meta-training phase to the novel tasks. The meta-learning-based methods generally can be divided into three types: optimization-based methods, metric-based methods and data-augmentation-based methods.
The optimization-based methods learn sub-optimal parameters for every task as the initial parameters that can be quickly adapted to novel tasks by only a few steps of gradient descent. MAML  is the first optimization-based method that utilizes a second-order optimizing strategy with meta-learning framework to quickly update the parameters. To simplify the optimization, Nichol  utilized the first-order method to replace the second-order derivation in MAML. Rusu  updated parameters in a low-dimensional latent space which is more practical in low-data scenarios. Since the shared initialization may lead to the conflict over tasks, Baik  proposed a task-and-layer-wise attenuation to forget the prior information selectively.
In the metric-based methods, the embedding space is constructed to measure the similarities among different feature embeddings. The simplicity and efficiency make metric-based methods highly attractive in the field of few-shot learning. Matching Networks 
utilize an attention mechanism based on LSTM to learn to classify the novel samples. Snell proposed Prototypical Networks to measure the distance between each sample and the prototypes of corresponding class. Relation Network  utilizes a learnable metric method instead of manual measurement, which is more flexible in metric-based classifications.
The main idea of the data-augmentation-based methods is to alleviate the lack of labeled data with data augmentation. Wang  generated samples with the idea of GAN to expand the diversity of data. Zhang  utilized a saliency detection method to fuse foreground and background in different images to augment samples. In order to alleviate the mode collapse problem in GAN, Li  employed the cWGAN in few-shot learning to ensure the diversity of generated data.
Ii-B Learning with Semantic Information
Semantic information usually plays a crucial role in various tasks, such as image-text matching [37, 2], emotion recognition [18, 49] and zero-shot learning [7, 15]. When the samples in visual modality are scarce, the semantic information is a good choice to assist model in training. Many zero-shot learning methods align visual and semantic representation to achieve the classification of novel classes without labeled samples. For example, Frome  presented a deep visual-semantic embedding model to learn the semantic relationships among classes with the semantic data, and map the visual samples into a semantic space to be classified. Ji  employed a semantic embedding space to transfer knowledge from seen domain to unseen domain with an attribute-guided network to address the cross-modal zero-shot hashing retrieval tasks. Considering the potential bias between seen and unseen classes, Gao  utilized a joint generative model to generate high-quality unseen features, which is further augmented with a self-training strategy. Guan  learned a robust cross-modal projection by synthesizing the unseen class data and designing a novel projection learning model to best utilize the synthesized data.
Based on the success of zero-shot learning, some methods utilizing auxiliary semantic information are proposed to boost the few-shot learning in recent years. Xing  integrated the semantic embeddings into visual features with an adaptive convex combination to assist the classification. Similarly, SAP-Net  refines the embeddings with the guidance of semantic information. Schwartz  further improved the few-shot learning with multiple semantics. For the discrepancy between visual and semantic modalities, Tokmakov  designed a semantic-based soft constraint regularization to learn the compositional representation of each sample. Chen  synthesized sample features in a semantic space with an encoder-decoder framework to increase the diversity of feature embeddings. Huang  employed the attention mechanism to emphasize or suppress the representation with the guidance of semantic information. Zhang  addressed few-shot classification in both relative and absolute views, which utilizes the semantic information and class labels simultaneously to represent the similarities among different samples and the absolute concept of instances.
Ii-C Propagation with Graph Model
Graph model is effective in constructing the relationship among different nodes, which has received great attention in recent years. Thanks to the great expressive power of graphs, especially the convincing performance of deep learning based Graph Neural Network (GNN), the graph-based methods have been employed in types of tasks, e.g., nodes classification , link prediction , and clustering .
In this case, various of graph-based methods are proposed for few-shot learning to extract the relationship and rectify the graph in an episode. In order to further explore the potential of graph model in few-shot learning, Kim  proposed EGNN to dynamically update both the nodes and edges and Yang  constructed both the distribution-level relations and instance-level relations with DPGN. Moreover, Label Propagation (LP)  is also a classical method to transfer knowledge from neighbors of each node, which is proved effective for few-shot classification with meta-learning framework. TPN  is the first method utilizing label propagation in few-shot learning. After this, Rodriguez  employed an Embedding Propagation method to yield a smoother embedding manifold. Similarly in zero-shot learning, Liu  optimized the semantic space with Attribute Propagation Network to refine the attributes of each class. Inspired by this, our method aims at propagating information both in visual and semantic space with cross-modal assistance, called Modal-Alternating Propagation (MAP-Net).
We follow the episodic training paradigm as  for few-shot learning. In general, our model is trained on -way -shot settings. Each episode consists of categories from meta-training set and is divided into two parts: labeled samples in the support set and several query samples in the query set. Thus, the semantic information of support set is available. The support set contains totally labeled samples with their semantic vectors. It is denoted as , where represents the -th image, semantic vectors and the corresponding label. The query set contains samples with no semantic vectors. The episodic paradigm aims at obtaining the optimal performance on the query set by training the model with the support set.
In this work, we propose a Modal-Alternating Propagation Network (MAP-Net) for few-shot learning to rectify the feature embeddings by constructing information symmetry. Figure 2 presents the main framework of MAP-Net, which consists of a modal-alternating propagation module (MAP-Module) and a feature fusion module. In the MAP-Module, two graphs in both visual and semantic modalities are constructed for learning to propagate information alternately to obtain the semantic vectors of query samples. Concretely, we first modify the visual embeddings with the guidance of semantic to reduce the high intra-class variance in visual modality. Then the semantic graph will be completed by applying the information transferred from visual space. We employ a Relation Guidance (RG) strategy to guide the visual embeddings. Rather than simply aligning two modal embeddings with a penalty function, we guide the relation information among samples in visual modality with that in semantic modality. With the guidance of relation map, both the visual graph and semantic graph are updated to obtain the symmetrical embeddings of support and query samples. Finally, the propagated visual and semantic embeddings are fused as augmented embeddings by a convex combination as, which are employed to classify in corresponding categories.
Iii-C Modal-Alternating Propagation
Since the categories of query samples are unknown during test stage, their semantic information is unavailable. This may lead to an information asymmetry between support samples and query samples. To reduce the bias caused by the lack of the query semantics, we design a Modal-Alternating Propagation Module (MAP-Module) to generate the pseudo-semantic embeddings through updating graphs with propagating operation.
Graph Construction. There are two types of graphs, the visual graph and the semantic graph in MAP-Module. They transfer and propagate information to generate the pseudo query semantics. In the visual graph , the node is the feature embedding obtained from a feature encoder (CNN), which can be denoted as
where represents the feature encoder. For semantic graph, we utilize the attributes or text descriptions of samples as semantic information. Each node is also an embedding of semantic information encoded by a semantic encoder. The difference is the query semantic embeddings are initialized with zero vectors. is defined as:
is a multi-layer perceptron (MLP) used to encode the semantic vectors. Then the adjacency matrixin is obtained from the similarities among nodes. In this paper, we employ Gaussian similarity function to calculate graph edges, where is the Euclidean distance between two neighbor nodes and is a scaling factor. Especially, we make the values of diagonal
to avoid self-reinforcement. In this paper, we utilize the standard deviation of distance matrixas as in .
The adjacency matrix of semantic graph is similar as . However, the query semantic vectors are inapplicable to provide the relationships among different nodes in semantic graph. To address this problem, we explore to transfer the relation information from the visual graph to the semantic graph to acquire the completed adjacency matrix , which can be applied to propagate information to generate the pseudo-semantic embeddings of query samples.
Relation Guidance. A regular way to guide the relation information across modal is regarding as the adjacency matrix in semantic graph. However, the visual feature embeddings are difficult to represent the corresponding samples correctly due to the unrelated information in images (e.g., background), while the semantic vectors are easier to discriminate. Therefore, is inappropriate for the relationships among samples, especially when the samples in each episode are scarce so that the distribution is incorrect. To obtain an appropriate adjacency matrix, we propose a Relation Guidance (RG) strategy to modify the relationships among visual samples with the guidance of support semantic vectors.
Specifically, we first obtain the relation maps and of two modalities. Each position of relation map is a relation vector representing the difference between two samples, which is calculated as follows:
The relation map can be denoted by four parts as follows:
where the relation vectors of the query-support and query-query pairs in semantic embeddings are 0. Thus, we design RG to employ the corresponding known parts to inference the unknown parts. Its core idea is to utilize the guidance of relations of support-support pairs in semantic graph to transfer to , which is more accurate to represent the relationships among different samples. Concretely, we set as training samples and as corresponding labels to train the relation transfer module with the mean square error (MSE) as loss function:
where . Then all relation vectors in will be fed into RG-module to obtain the modified :
Then the rectified adjacency matrix can be acquired as follows:
where represents the distance between and , which is the -norm of the rectified relation vector.
Graph Propagation. With rectified adjacency matrix, both the visual graph and the semantic graph can be updated with graph propagation. Concretely, the matrix is first symmetrically normalized as
where is the degree matrix of graph. Then, we follow the label propagation operation as in  and get the propagation matrix as
where is a smoothing factor and
is the identity matrix. Finally, the visual embeddings can be rectified as
and the completed semantic embeddings can be obtained as
With guidance of the pseudo query semantic embeddings and the existing support semantic embeddings, the feature embeddings are modified to be more discriminative. Most importantly, the information asymmetries between support set and query set are reduced significantly.
Iii-D Feature Fusion and Classification
After the modal-alternating propagation module, the information asymmetries are reduced significantly as the visual feature embeddings and semantic embeddings are both available for every sample. Since the information of the two modalities is complementary to enhance the features, we fuse embeddings by employing a convex combination to obtain more discriminative embeddings, which is denoted as:
where is a coefficient learned with a weight learner:
where is an MLP and represents the concatenating operation. Then following the operation of Prototypical Networks , we calculate the classification loss as follows:
where is the Euclidean distance,
is the probability that query samplebelongs to class and is the prototype of augmented features in class .
Therefore, the total loss is
where is the weight coefficient of relation transfer loss.
|5-way 1-shot||5-way 5-shot|
|MatchingNet ||N||ConvNet4||60.52 0.88%||75.29 0.75%|
|ProtoNet ||N||ConvNet4||50.46 0.88%||76.39 0.64%|
|RelationNet ||N||ConvNet4||62.34 0.94%||77.84 0.68%|
|MAML ||N||ConvNet4||54.73 0.97%||75.75 0.75%|
|ARML ||N||ConvNet4||62.33 1.47%||73.34 0.70%|
|AM3 ||Y||ConvNet4||73.78 0.28%||81.39 0.26%|
|AGAM ||Y||ConvNet4||75.87 0.29%||81.66 0.25%|
|MAP-Net (Ours)||Y||ConvNet4||80.92 0.21%||85.88 0.17%|
|MatchingNet ||N||ResNet12||60.96 0.35%||77.31 0.25%|
|RelationNet ||N||ResNet12||60.21 0.35%||80.18 0.25%|
|MAML ||N||ResNet18||69.96 1.01%||82.70 0.65%|
|FEAT ||N||ResNet12||68.87 0.22%||82.90 0.15%|
|AFHN ||N||ResNet18||70.53 1.01%||83.95 0.63%|
|Dual TriNet ||Y||ResNet12||69.61 0.46%||84.10 0.35%|
|AGAM ||Y||ResNet12||79.58 0.23%||87.17 0.23%|
|MAP-Net (Ours)||Y||ResNet12||82.45 0.23%||88.30 0.17%|
95% confidence intervals.denotes that the accuracies are reported in .
Iv-a Experiment Setup
|Datasets||Training set||Validation set||Testing set||Semantics|
|SUN Attribute ||580||65||72||Attributes|
|Flowers 102 ||60||20||22||Text-descriptions|
Datasets. We conduct the experiments on three benchmark datasets with semantic information, i.e., attributes or text descriptions: Caltech-UCSD-Birds 200-2011 (CUB) , SUN Attribute Database (SUN)  and Oxford 102 Flower (Flower) . The details about datasets are shown in Table I. CUB is a fine-grained dataset of bird species that consists of 11788 images of 200 categories and 312 attributes for each class. We follow the split in , which selects 100, 50, 50 classes for training, validation and testing respectively. SUN contains 14340 scene images of 717 classes with 102 attributes. Following , 580, 65, 72 classes will be employed for training, validation and testing. Flower is also a fine-grained dataset with 102 classes of flower species, where the number of images is varied from 40258 in each class. Each image has a text description in 1024 dimensions. 60, 20, 22 classes are used for training, validation and testing. All images in these datasets are resized to 8484 for fair comparisons.
|5-way 1-shot||5-way 5-shot|
|MatchingNet ||ConvNet4||55.72 0.40%||76.59 0.21%|
|ProtoNet ||ConvNet4||57.76 0.29%||79.27 0.19%|
|RelationNet ||ConvNet4||49.58 0.35%||76.21 0.19%|
|AM3 ||ConvNet4||62.79 0.32%||79.69 0.23%|
|AGAM ||ConvNet4||65.15 0.31%||80.08 0.21%|
|MAP-Net (Ours)||ConvNet4||67.73 0.30%||80.30 0.21%|
|5-way 1-shot||5-way 5-shot|
|AM3 ||ConvNet4||74.00 0.29%||88.67 0.18%|
|AGAM ||ConvNet4||73.27 0.29%||89.49 0.17%|
|MAP-Net (Ours)||ConvNet4||77.41 0.28%||90.63 0.16%|
|5-way 1-shot||5-way 5-shot||5-way 1-shot||5-way 5-shot||5-way 1-shot||5-way 5-shot|
|75.30 0.25%||80.28 0.22%||63.11 0.30%||78.91 0.21%||74.78 0.28%||88.81 0.17%|
|✓||78.14 0.24%||81.42 0.22%||59.12 0.31%||76.51 0.25%||74.97 0.28%||89.76 0.17%|
|✓||79.53 0.23%||84.08 0.19%||66.90 0.29%||79.82 0.23%||76.14 0.28%||90.19 0.17%|
|✓||✓||80.03 0.23%||84.52 0.19%||66.83 0.29%||79.70 0.23%||76.28 0.28%||89.78 0.17%|
|✓||✓||79.98 0.22%||85.15 0.19%||67.35 0.30%||79.88 0.24%||76.68 0.29%||90.38 0.16%|
|✓||✓||✓||80.92 0.21%||85.88 0.18%||67.73 0.30%||80.30 0.21%||77.41 0.28%||90.63 0.16%|
Experimental Settings. We conduct experiments on 5-way 1-shot and 5-way 5-shot settings, and 15 query samples are employed for both meta-training and meta-testing in each episode. The average accuracy (%) and the corresponding 95% confidence interval over the 5000 episodes are reported to express the performance. To make a fair comparison, our method is trained in the inductive setting. This is to say, we apply only one query sample to build corresponding graph at a time.
as backbones to extract the visual features of images, while the semantic embeddings are encoded with a Multi-Layer Perceptron (MLP). In the meta-training stage, we train the model with 60 epochs with 1000 episodes and 600 episodes for validation per epoch both in 1-shot and 5-shot settings. We randomly select K support samples and 15 query samples in each episode. The Adam optimizer is utilized with the initial learning rate 0.001 which will be reduced by 0.1 in every 15 epochs. The batch size is 5 for ConvNet-4 and 1 for ResNet-12. For MAP-Net, the smoothing factor is set to 0.2 and weight coefficient is 1 for CUB and 0.1 for SUN and Flower.
Iv-B Comparison with State-of-the-art Methods
We choose fourteen popular meta-learning based methods as competitors, including both nonsemantic-based methods and semantic-based methods:
(1) Nonsemantic-based methods.
Data-augmentation-based methods: AFHN .
(2) Semantic-based methods.
Generating embeddings with semantics: Dual TriNet .
We compare our MAP-Net with these methods with both ConvNet-4 and ResNet-12 on CUB. Since there is no method in previous conduct the experiments with ResNet-12 on SUN and Flower, we compare our method with these methods on them only with the ConvNet-4. Table II, III, IV show the results on three benchmark datasets respectively.
It can be observed that our method outperforms all methods on three datasets in cases of ConvNet-4 and ResNet-12 backbones. Particularly, our MAP-Net achieves 5.05% and 4.22% performance gains on CUB, 2.58% and 0.22% on SUN, and 1.2% and 1.14% on Flower with the backbone ConvNet-4 against the second-best method on both 1-shot and 5-shot settings. For ResNet-12 on CUB, MAP-Net also obtains the best performance on both 1-shot and 5-shot settings, which outperforms the second-best approach AGAM  in 2.87% and 1.13% respectively.
Besides, we have the following three observations. (1) The improvements are more significant on 1-shot setting than those on 5-shot setting. It proves that the auxiliary semantic information is more beneficial in the occasion with extremely few visual samples. (2) The improvements with ResNet-12 are inferior to those with ConvNet-4. It demonstrates that the complex feature encoder suppresses the effectiveness of semantic information. When the feature encoder extracts more discriminative embeddings, the balance between visual and semantic modalities is broken due to that the information carried from semantic vectors is finite. (3) It is observed that the performance is improved slightly on 5-shot setting of SUN. We suppose the main reason is that the attribute vectors of SUN are too sparse that the effectiveness is suppressed when the visual information increasing.
Iv-C Ablation Studies
We conduct ablation studies to prove the effectiveness of the main components in MAP-Net on three datasets, which are shown in Table V. We set the model without these components as the baseline, which is similar to AM3 . The only difference is that AM3 combines the visual prototypes with the class semantic embeddings, while our baseline combines support visual features with their semantic embeddings.
It is observed that our method with all components obtains the best performance on three datasets. Both visual graph propagation and semantic graph propagation are beneficial in MAP-Net, especially the propagation in semantic modality. For example, compared with the baseline, the visual propagation brings 2.84% and 1.14% improvements on 1-shot and 5-shot on CUB, while the semantic propagation brings 4.23% and 3.8% improvements. For SUN and Flower, the semantic propagation also improves the accuracy by a large margin. It demonstrates that the information asymmetries between support and query samples are reduced significantly with semantic propagation. However, the visual propagation does not bring significant gains, and the performance is even damaged on SUN. We suppose the reason is that the information propagated from neighbor nodes in the visual graph might be useless or even harmful since the relationships among samples are not quite accurate in low-data scenarios.
|5-way 1-shot||5-way 5-shot|
|MAP-Net (w/o guidance)||ConvNet4||80.03 0.23%||84.52 0.19%|
|MAP-Net (w IC)||ConvNet4||78.62 0.24%||82.50 0.21%|
|MAP-Net (w RC)||ConvNet4||79.13 0.24%||82.65 0.21%|
|MAP-Net (w RG)||ConvNet4||80.92 0.21%||85.88 0.17%|
For the same reason, the performance is also limited when utilizing the visual propagation and the semantic propagation simultaneously. By contrast, this issue is alleviated with Relation Guidance strategy, which brings about 1% on all settings. It indicates that with the guidance of semantic embeddings with accurate distribution, the rectified relation map represents the relationships among samples more correctly to propagate discriminative pseudo-semantic embeddings. Moreover, the adverse impact of visual propagation is also reduced.
Iv-D Further Analysis
Impact of Different Guidance Methods. In order to validate the efficiency of the Relation Guidance strategy, we design some other methods to guide the visual embeddings with semantic information. In this section, we compare the results on CUB with different guidance methods: Instance Constraint (IC) – constraining every cross-modal representation of support samples directly; Relation Constraint (RC) – constraining the corresponding relation vectors of support-support pairs in two modalities; and our proposed Relation Guidance (RG) method. The results are reported in Table VI.
We could observe that both IC and RC hurt the performance, which indicates that directly constraining cross-modal embeddings is inappropriate in few-shot learning. By contrast, MAP-Net with RG achieves promising performance. It demonstrated that the information propagated among nodes is more accurate with the Relation Guidance strategy. The relative information is more effective to be utilized for alleviating the cross-modal discrepancy.
Analysis of Information Asymmetries. Figure 4 shows the accuracies of our method and two baseline methods (AM3 and ProtoNet) on three datasets on 1-10 shot settings. It can be observed that the improvement of AM3 compared with ProtoNet is reduced when the number of shots increased. The ProtoNet even surpasses AM3 with 10-shot setting on CUB and Flower datasets. By contrast, our method outperforms these two baselines whatever the number of shots is. We interpret the result in two aspects. Firstly, the semantic information is more essential in the scenario with extremely few visual samples (e.g., 1-shot), while the abundant visual information (e.g., 10-shot) can suppress the effectiveness of semantic. Secondly, more supervision information reflects more accurate relationships among different samples which are beneficial to generate the pseudo-semantic embeddings of query samples to reduce the information asymmetries. Therefore, our method also improves the performance as the number of shots increases.
Figure 5 shows the relationship between the weight coefficient and the number of shots. It is observed that the coefficient is larger as the number of shots increases as analyzed in . Moreover, for query samples is larger than that for support samples. We consider the reason is that the pseudo-semantic embeddings of query samples propagated from adjacent support samples cannot represent the real semantic information thoroughly, though they supplement the missing information to some extent, which is worthy of further study in the future. We can also observe that with the increase of shots, the gap between the values of for support samples and query samples is decreasing. It means that the pseudo-semantic of query samples and real semantic of support samples are more symmetric with abundant relationship information constructed with more support samples, which also proves the effectiveness of modal-alternating propagation.
The difference among three datasets expresses the different representativeness of semantic information to corresponding samples in different datasets. The semantic information used in CUB is more representative since its weight is large, almost 0.9, on semantic embeddings, while those on SUN and Flower are relatively small.
Visualization Analysis. To prove the effectiveness of our method, we visualize the embedding space on CUB dataset by the t-SNE approach. It is observed in Fig. 6 that the embeddings obtained from AM3 and MAP-Net are more separable than Prototypical Networks with the auxiliary of semantic information. As for ProtoNet, one prototype acquired by averaging support samples is even far from the corresponding query samples, which extremely affects the classification results. Moreover, although the clusters of AM3 are relatively scattered, there is a large distance between the query embeddings and their prototypes, which is caused by the information asymmetries due to the lack of query semantic information. And two prototypes even overlap with each other. By contrast, the query embeddings and their prototypes of MAP-Net are more consistent with the generated pseudo-semantic embeddings.
To address the information asymmetry problem when utilizing the auxiliary semantic information in few-shot learning, we have proposed a Modal-Alternating Propagation Network (MAP-Net) to supplement the lack of query semantics. The MAP-Net generates the query semantics and modifies the feature embeddings by employing the relationships to alternatingly propagate graphs in two modalities. Furthermore, with the proposed flexible Relation Guidance (RG) strategy, the large discrepancy between different modalities is reduced significantly. The relation vectors are more accurate to represent the true relationships among samples with the guidance of semantics. Extensive experiments on three datasets with semantic information have demonstrated the effectiveness of our proposed MAP-Net.
-  (2020) Learning to forget for meta-learning. In CVPR, pp. 2379–2387. Cited by: §II-A.
-  (2020) IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In CVPR, pp. 12655–12663. Cited by: §II-B.
-  (2019) A closer look at few-shot classification. In ICLR, pp. 1–16. Cited by: §I, §IV-A.
-  (2020) A new meta-baseline for few-shot learning. arXiv preprint arXiv:2003.04390. Cited by: §IV-A.
-  (2019) Multi-level semantic feature augmentation for one-shot learning. IEEE Transactions on Image Processing 28 (9), pp. 4594–4605. Cited by: §I, §II-B, TABLE II, 2nd item.
-  (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pp. 1126–1135. Cited by: §II-A, TABLE II, 2nd item.
-  (2013) DeViSe: a deep visual-semantic embedding model. In NeurIPS, pp. 2121–2129. Cited by: §II-B.
-  (2020) Zero-VAE-GAN: generating unseen features for generalized and transductive zero-shot learning. IEEE Transactions on Image Processing 29 (), pp. 3665–3680. Cited by: §II-B.
-  (2018) Few-shot learning with graph neural networks. In ICLR, pp. 1–13. Cited by: §II-C.
-  (2020) Zero and few shot learning with semantic feature synthesis and competitive learning. IEEE transactions on pattern analysis and machine intelligence 43 (7), pp. 2510–2523. Cited by: §II-B.
-  (2021, Online) Meta-learning in neural networks: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–20. Cited by: §I.
-  (2021) Attributes-guided and pure-visual attention alignment for few-shot recognition. In AAAI, pp. 7840–7847. Cited by: §I, §II-B, TABLE II, 1st item, §IV-A, §IV-B, TABLE III, TABLE IV.
-  (1987) On beyond zebra: the relation of linguistic and visual information. Cognition 26 (2), pp. 89–114. Cited by: §I.
-  (2021) Few-shot human-object interaction recognition with semantic-guided attentive prototypes network. IEEE Transactions on Image Processing 30 (), pp. 1648–1661. External Links: Cited by: §II-B.
-  (2020) Attribute-guided network for cross-modal zero-shot hashing. IEEE transactions on neural networks and learning systems 31 (1), pp. 321–330. Cited by: §II-B.
-  (2019) Edge-labeling graph neural network for few-shot learning. In CVPR, pp. 11–20. Cited by: §II-C, §II-C.
-  (2015) Adam: a method for stochastic optimization. In ICLR, pp. 1–15. Cited by: §IV-A.
-  (2017) Emotion recognition in context. In CVPR, pp. 1667–1675. Cited by: §II-B.
-  (2020) Boosting few-shot learning with adaptive margin loss. In CVPR, pp. 12576–12584. Cited by: §I.
-  (2020) Adversarial feature hallucination networks for few-shot learning. In CVPR, pp. 13470–13479. Cited by: §II-A, TABLE II, 3rd item.
-  (2020) Attribute propagation network for graph zero-shot learning. In AAAI, pp. 4868–4875. Cited by: §II-C.
-  (2019) Learning to propagate labels: transductive propagation network for few-shot learning. In ICLR, pp. 1–14. Cited by: §II-C, §III-C.
-  (2018) On first-order meta-learning algorithms. arXiv preprint arXiv: 1803.02999. Cited by: §II-A.
-  (2008) Automated flower classification over a large number of classes. In ICVGIP, pp. 722–729. Cited by: §IV-A, TABLE I.
-  (2018) TADAM: task dependent adaptive metric for improved few-shot learning. In NeurIPS, pp. 721–731. Cited by: TABLE II, 1st item.
The sun attribute database: beyond categories for deeper scene understanding. International Journal of Computer Vision 108 (1-2), pp. 59–81. Cited by: §IV-A, TABLE I.
-  (2020) Embedding propagation: smoother manifold for few-shot classification. In ECCV, pp. 121–138. Cited by: §II-C, §III-C.
-  (2019) Meta-learning with latent embedding optimization. In ICLR, pp. 1–17. Cited by: §II-A.
Generalized zero-and few-shot learning via aligned variational autoencoders. In CVPR, pp. 8247–8255. Cited by: §I.
-  (2019) Baby steps towards few-shot learning with multiple semantics. arXiv preprint arXiv:1906.01905. Cited by: §II-B, TABLE II, 1st item.
-  (2017) Prototypical networks for few-shot learning. In NeurIPS, pp. 4077–4087. Cited by: §I, §II-A, §III-D, TABLE II, 1st item, TABLE III, TABLE IV.
-  (2018) Learning to compare: relation network for few-shot learning. In CVPR, pp. 1199–1208. Cited by: §I, §II-A, TABLE II, 1st item, TABLE III, TABLE IV.
-  (2019) Learning compositional representations for few-shot recognition. In ICCV, pp. 6372–6381. Cited by: §I, §II-B, TABLE II, 1st item, TABLE III.
-  (2016) Matching networks for one-shot learning. In NeurIPS, pp. 3637–3645. Cited by: §I, §II-A, §III-A, TABLE II, 1st item, §IV-A, TABLE III.
-  (2011) The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §IV-A, TABLE I.
-  (1999) Learning to recognize objects. Trends in cognitive sciences 3 (1), pp. 22–31. Cited by: §I.
-  (2020) Consensus-aware visual-semantic embedding for image-text matching. In European Conference on Computer Vision, pp. 18–34. Cited by: §II-B.
-  (2021) Learning efficient hash codes for fast graph-based data similarity retrieval. IEEE Transactions on Image Processing, pp. 1–14. Cited by: §I.
-  (2021, Online) How to trust unlabeled data instance credibility inference for few-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–14. Cited by: §I.
-  (2018) Low-shot learning from imaginary data. In CVPR, pp. 7278–7286. Cited by: §II-A.
-  (2019) Adaptive cross-modal few-shot learning. In NeurIPS, pp. 4847–4857. Cited by: §I, §II-B, §III-B, TABLE II, 1st item, §IV-C, §IV-D, TABLE III, TABLE IV.
-  (2020) DPGN: distribution propagation graph network for few-shot learning. In CVPR, pp. 13390–13399. Cited by: §II-C.
-  (2020) Automated relational meta-learning. In ICLR, pp. 1–19. Cited by: TABLE II, 2nd item.
-  (2020) Few-shot learning via embedding adaptation with set-to-set functions. In CVPR, pp. 8808–8817. Cited by: TABLE II, 1st item.
-  (2019) Heterogeneous graph neural network. In ACM SIGKDD, pp. 793–803. Cited by: §II-C.
-  (2021) Rethinking class relations: absolute-relative supervised and unsupervised few-shot learning. In CVPR, pp. 9432–9441. Cited by: §II-B, TABLE II, 1st item, TABLE IV.
-  (2019) Few-shot learning via saliency-guided hallucination of samples. In CVPR, pp. 2770–2779. Cited by: §II-A.
-  (2018) Link prediction based on graph neural networks. NeurIPS 31, pp. 5165–5175. Cited by: §II-C.
Spatial–temporal recurrent neural network for emotion recognition. IEEE Transactions on Cybernetics 49 (3), pp. 839–847. Cited by: §II-B.
-  (2004) Learning with local and global consistency. In NeurIPS, pp. 321–328. Cited by: §II-C.
-  (2020, Online) Personalized image aesthetics assessment via meta-learning with bilevel gradient optimization. IEEE Transactions on Cybernetics, pp. 1–14. Cited by: §I.