Binocular Mutual Learning for Improving Few-shot Classification

08/27/2021 ∙ by Ziqi Zhou, et al. ∙ Megvii Technology Limited Dalian University of Technology 0

Most of the few-shot learning methods learn to transfer knowledge from datasets with abundant labeled data (i.e., the base set). From the perspective of class space on base set, existing methods either focus on utilizing all classes under a global view by normal pretraining, or pay more attention to adopt an episodic manner to train meta-tasks within few classes in a local view. However, the interaction of the two views is rarely explored. As the two views capture complementary information, we naturally think of the compatibility of them for achieving further performance gains. Inspired by the mutual learning paradigm and binocular parallax, we propose a unified framework, namely Binocular Mutual Learning (BML), which achieves the compatibility of the global view and the local view through both intra-view and cross-view modeling. Concretely, the global view learns in the whole class space to capture rich inter-class relationships. Meanwhile, the local view learns in the local class space within each episode, focusing on matching positive pairs correctly. In addition, cross-view mutual interaction further promotes the collaborative learning and the implicit exploration of useful knowledge from each other. During meta-test, binocular embeddings are aggregated together to support decision-making, which greatly improve the accuracy of classification. Extensive experiments conducted on multiple benchmarks including cross-domain validation confirm the effectiveness of our method.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Conventional classification methods heavily rely on massive labeled data [russakovsky2015imagenet] with diverse visual variations. However, in many realistic scenarios, only limited labeled data is available [yao2019learning, mahajan2020meta], thereby giving rise to the investigation of few-shot classification (FSC), where only few available training data is given for the learning of new visual concepts. Such a setting makes FSC a challenging problem, since novel classes are unpredictable and the sampling of few shots is also biased. In order to overcome those difficulties, many effective approaches have been proposed in recent years, which can be mainly summarized into two categories according to training strategies. The first category is fine-tuning based paradigms [qiao2018few, lifchitz2019dense, chen2019closer, chen2020new, tian2020rethinking]

, which learn classifiers in the whole base class space with a straightforward purpose of maximize differences between classes. Since all base classes are visible under each iteration, we refer to this kind of methods as the

global view. The other promising strategy is metric-based meta-training schemes [fei2006one, vinyals2016matching, snell2017prototypical, sung2018learning, oreshkin2018tadam, ye2020few, li2020boosting, guo2020attentive], which only tune on a few classes in each episode. The main idea comes from metric learning and the purpose is to match unlabeled query to its correct class with a small labeled support set. Because the visible range of base classes for each meta-task is limited, we oppositely call them the local view.


w/global w/local strategy Acc.


MAML [finn2017model] one-stage 63.1
MatchingNet [vinyals2016matching] one-stage 55.3
RelationNet [sung2018learning] one-stage 65.3
ProtoNet [snell2017prototypical] one-stage 68.2


Rethink [tian2020rethinking] one-stage 82.1
DC [lifchitz2019dense] one-stage 79.0
CloserLook [chen2019closer] one-stage 75.7


DeepEMD [zhang2020deepemd] two-stage 82.4
FEAT [ye2020few] two-stage 82.1
Meta-Baseline [chen2020new] two-stage 79.3
Neg-Cosine [liu2020negative] two-stage 81.6


Our BML one-stage 83.6


Table 1: Comparison of BML with several representative methods from the view of base class space. Accuracy (Acc.) from miniImageNet [vinyals2016matching]. Obviously, the unified perspective of BML is more effective.

Considering that single view (whether global or local) is relatively weak, it is not enough to provide adequate knowledge for accurate classification. What’s more, the combination of dual views fits well with the characteristics of “people deepen their perception through two eyes”. To this end, we propose this new Binocular Mutual Learning (BML) paradigm, which equips the network with a global view and a local view simultaneously. The combination of two complementary views works like a binocular system, and the mutual interaction [zhang2018deep] through two views further promotes their cooperation and calibrates the inappropriate expression caused by single “biased” view. Concretely, BML generates better expression through both intra-view and cross-view modeling. The intra-view training captures view-specific knowledge, where two balanced feature space are built, one focuses on inter-class relationship perception (global view) and the other pays attention on matching details (local view). Meanwhile, the cross-view mutual interaction facilitates the implicit knowledge transfer from each other. To balance “binocular parallax”, we enlarge the optimization difficulty of the local view, so that the global view can learn more useful knowledge from mutual interaction (For more details, please refer to Section.3.2.2).

Clearly, BML paradigm has two advantages: strong transferability and high time-efficiency, which are manifested from the following comparisons. Concretely, compared with single view based methods mentioned above, BML has two complementary views, so that more transferable and reliable knowledge can be learned. In contrast, single global training lacks additional constraints, making it easy to over-fit on base patterns [goldblum2020unraveling]. Meanwhile, single local training is restricted by local perspectives, whose performance is heavily depend on the configuration of tasks, not to mention the complex structures. Moreover, compared with two-stage methods which firstly excute global training and then tune the embedding with local training, BML unifies the two views in a one-stage framework and enables the promotion of each other. By contrast, two-stage methods [sun2019meta, chen2019closer, zhang2020deepemd, ye2020few, liu2020negative] are time-consuming. They focus on how to better learn embedding in the first stage to provide stronger features for the second stage of optimization [liu2020negative], but ignore that local view and global view can promote each other in a unified manner.

We highlight the advantages of BML compared with several typical approaches in Table 1. As the first batch of methods to consider the combination of dual views, we propose an elegant compatibility strategy: binocular mutual learning, which is inspired by the fact that human beings usually perceive the world through two eyes (benefit from appropriate binocular parallax). The two complementary views simulate the binocular mode, and the mutual interaction calibrates deviation. Extensive experiments confirm the effectiveness of BML, which is mainly reflected on the stable performance under different granularity evaluation. No matter facing coarse-grained (e.g., miniImageNet [vinyals2016matching]) or fine-grained (e.g. CUB [WahCUB_200_2011]) situation, BML performs well. However, single view based methods cannot handle all the situations. This confirm that the unified framework does facilitate the mutual calibration of the two views. In summary, the contributions of this paper are as follows:

  • [itemsep=-5pt,topsep=5pt]

  • We closely analyze the status quo of FSC and propose an efficient one-stage Binocular Mutual Learning paradigm: BML, which elegantly aggregate the global view and the local view through both intra-view and cross-view modeling.

  • To enhance mutual learning, we propose an elastic loss to readjust the optimization difficulty of the local view, which promotes the bidirectional implicit knowledge transfer.

  • Extensive experiments on multiple benchmarks including cross-domain validation verify the effectiveness of our framework.

2 Related Works

2.1 Fine-Tuning based Methods (global view)

Researches represented by [qiao2018few, sun2019meta, lifchitz2019dense, chen2019closer, tian2020rethinking, liu2020negative] pay more attention to global training by simply learning a class-specific embedding with fully-connected (FC) layers. Among them, FSIR [qiao2018few] adapted the pretrained model to the new categories by directly predicting the parameters from activation, while DC [lifchitz2019dense] started from the perspective of spatial information mining and performed dense classification. MTL [sun2019meta] further enhanced the pretrained model through a hard task meta-batch mechanism. And [chen2019closer] proposed two different classification layers, including a conventional FC Layer (CloserLook) and a FC layer with feature normalization (CloserLook++). To alleviate over-fitting on base patterns, The authors in [tian2020rethinking] employed self-distillation strategy and data augmentation to constraint the learning process.

Figure 1: The framework of BML. For each task , BML minimizes the global classification loss , the local matching loss and the distribution consistency loss to capture discriminative expression. During test, the collaborative features from two views are combined to support the final decision.

2.2 Meta-Training based Methods (local view)

One of the promising branch of meta-training is metric-based methods [vinyals2016matching, snell2017prototypical, sung2018learning, ye2020few, oreshkin2018tadam, li2020boosting, li2020adversarial, li2019finding, yang2020dpgn, hou2019cross, guo2020attentive, boudiaf2020transductive, qiao2019transductive], which meta-learn an ideal metric space to bring homogeneous samples closer while push heterogeneous samples away. The classification process is carried out under the guidance of the “Nearest Neighbor” principle. Specifically, Matching Networks [vinyals2016matching] proposed a LSTM [hochreiter1997long]-based encoding module to re-code the context of support feature, and computed attention scores using cosine distance. Similar variants include FEAT [ye2020few], which replaced LSTM with transformer [vaswani2017attention]. Prototypical Networks [snell2017prototypical] employed prototype to identify each class and calculated the Euclidean distance. Relation Networks [sung2018learning] further adopted a learnable correlation calculation module to measure pairwise similarity. Subsequent follow-uppers derived many variants with incorporating cross-modal information [li2020boosting], introducing adversarial noise [li2020adversarial], employing local descriptors [li2019revisiting] or mutual information [guo2020attentive, boudiaf2020transductive], making pretext tasks [zhang2021iept], using attention mechanisms [fei2021melr, hou2019cross] or learning task-relevant metrics [oreshkin2018tadam, li2019finding].

Besides, some studies started from optimization [finn2017model, li2017meta, jamal2019task, raghu2019rapid, lee2019meta]. A typical literature is MAML [finn2017model], in which, the parameters of a base learner were further optimized in a few iterations for quickly adaption for new tasks. Subsequent variants further developed MAML by designing more objectives [li2017meta], employing better classifiers [lee2019meta] or dynamically adjusting the weights of different tasks [jamal2019task]. More interestingly, the working mechanism of MAML was analyzed in detail in [raghu2019rapid] and the experiments highlighted that “Feature Reuse” is the key to its role. In other words, the performance is majorly determined by the quality of the learned features, which is revalidated by experiments conducted in this paper.

We also found that some recent researches [liu2020negative, chen2020new, zhang2020deepemd, ye2020few] have realized the promotion role of two views. Some of them [liu2020negative, chen2020new, zhang2020deepemd, ye2020few] employed a two-stage training scheme. They firstly pretrained under global mode, and then tuned the parameters using local mode. Others [chen2020diversity, hou2019cross, oreshkin2018tadam] promoted local training by learning additional global classifiers. All of them focus on the promotion of global to local but ignore the bidirectional cross-view interaction (which plays an important role in BML).

2.3 Mutual Learning

Mutual learning [zhang2018deep] is a new distillation modal that has shined in many fields recently, which breaks the conventional “teacher-student” structure which has fixed direction of supervision. In mutual learning, a group of students are collaboratively learned from each other, which helps to obtain more general models without pretrained teachers. Similar ideas have been employed in person re-ID [ge2020mutual], but not effectively applied in FSC. Literatures like [tian2020rethinking, li2020few] are close to this topic, but the training process is -staged with fixed pretrained teachers.

Inspired by mutual learning [zhang2018deep] and binocular parallax, we propose this unified BML framework with two complementary views. Each of them can be regarded as a “student”. During training, besides their offline hard supervision, two “students” also learn collaboratively and implicitly explore useful knowledge from each other.

3 Methodology

3.1 Preliminary

In the standard FSC scenario, we are given a base set with classes and a novel set with classes, where . Training is usually performed on classes and the optimization goal is to transfer the learned knowledge to new tasks built on . During meta-test, a family of tasks are constructed for evaluation. Concretely, each task has a support set and a query set . The support set contains classes and each class has images (i.e., the -way -shot setting). The query set includes unlabeled images. In most literature, is set to 5 and is set to 1 or 5, so do we.

3.2 Binocular Mutual Learning

As shown in Figure 1, BML has two complementary branches (views) based on shared blocks. In which, the global branch learns in the whole class space for inter-class relationship mining, and the local branch learns in each episode within few classes, aiming at matching each query sample to its support prototype. Besides, the two branches implicitly explore useful knowledge from each other by minimizing KL-Divergence based mimicry loss to match the feature distribution of its peers.

3.2.1 Global Intra-view Training

For the global branch, the learned features are related to the whole class space, which explicitly contain rich inter-class relationships.

Specifically, given task , we learn a global learner to map each image in to a high-dimensional space, and then a convolution layer is learned to classify each point of the feature to its corresponding class. Let denotes the width of the feature and

denotes the height, the probability estimate of point

is formulated by:


where is the softmax function, represents the dimension size which is 640 here.

The negative logarithm of is calculated to represent the loss of current feature point of input data . The total global loss is the average value of all images in . Formulaically,


Meaningfully, classifying each feature point correctly is helpful to capture spatial structure information of the foreground. Since each point can be traced back to a local area of the foreground, multiple points represents multiple local areas that is equivalent to a memory-saving multi-crop operation [lifchitz2019dense]. For fair comparison, we also apply point-wise classification to the baselines we compared in this paper.

3.2.2 Local Intra-view Training

The local branch borrows the idea of metric learning and learns to match each query sample with support prototypes in embedding space.

For task with classes, we divide it into a support set and a query set . A local learner is learned to map all samples to an embedding space, and the matching process is guided by nearest neighbor strategy, where prototypes are calculated, i.e., . To metric the similarity between query and each prototype, we simply employ Euclidean distance indicated by . The negative logarithm of all query samples are calculated to get the local loss.


Elastic constraint for magnifying optimization difficulty. Moreover, we consider the following two points and further optimize the local loss by applying an elastic constraint.

On the one hand, each task

is a collection of randomly sampled data, resulting in randomness of difficulty. Treating all tasks equally during cross-epoch training like 

[snell2017prototypical] will lead to an “unhealthy” phenomenon: further performance improvement is hard to obtain in the later training, because the model is dynamically growing while the difficulty is static [bengio2009curriculum]. On the other hand, the optimization difficulty of the local branch is relatively simple compared with the global one (since each query only has negative prototypes while global branch has ). The implementation of mutual promotion requires a seedbed, that is, the two views can provide valuable knowledge to each other.

Figure 2: The process of elastic loss. All queries are pushed away from their clusters to magnify the optimization difficulty, and the network is forced to pull them back again.

Therefore, we propose this elastic loss to enhance the training difficulty of the local branch. The complete process is shown in Figure 2. Simply, we modify the position of each query sample in the embedding space according to the difference between and , where is the positive prototype and is the nearest negative prototype. The above operation can be described as “push away”. Then we demand the network to pull these pushed out samples back to the vicinity of their positive prototypes, thus to learn more implicit knowledge and get further performance improvement.

To better perceive the foreground from different patches and learn rich relationship between and patch-specific , similar to the global branch, instead of performing global average pooling, we calculate elastic constraint from different spatial position. Forcing the network to pull to its positive prototype under different (, ) pairs helps to mine more hard samples, and avoid over-fitting on simple local tasks.

The updated local loss has a formulation as:


where is the patch-specific elastic constraint, whose specific detail is described in Algorithm 1. and are two scale factors, adjusts the push degree cross-epochs and adjusts the push degree cross-tasks.

Input: prototypes ; query ; current epoch ; total epoch ;
Output: Elastic Constraint

Get logits

of using Eq.3;
2 Cal distance between and : ;
3 Cal distance between and : ;
4 Cal ;
5 Return ;
Algorithm 1 Elastic Constraint on .

3.2.3 Cross-view Mutual Learning

In addition to separate intra-view learning, the two views also promote each other through cross-view mutual interaction. Specifically, for each view, in addition to completing its own offline hard tasks, it is also forced to minimize the mimicry loss from another view (based on KL-divergence), which encourage the implicit knowledge transfer. Clearly, the mutual loss has two sub-items, which are formulated as:


where represents the feature distribution calculated by . We consider the issue of interaction from the perspective of feature distribution consistency, and only learn relative relationships rather than hard constraints such as mean squared error in Euclidean space. This is because that too strong supervision signal is harmful to retain the specificity of the two views.

In summary, the final loss has following formulation:


where , and are weighting factors.

3.2.4 Inference

During meta-test, we integrate the results of the global and local branches. We do not perform global average pooling but directly flatten the features and calculate the Euclidean distance (the same as ProtoNet [snell2017prototypical]). The integration can be done on the feature level or on the logits level. In this paper, we simply integrate the global and local logits.

4 Experiments

In this section, we answer the following questions:

  • [itemsep=-5pt,topsep=-2pt]

  • How does our BML perform compared with SoTAs?

  • Why is binocular learning better?

  • How does the elastic loss work?

  • Is BML less sensitive to image quality?

  • Should the standards for validation be diverse?

4.1 Meta Datasets

We validate our BML on four commonly used benchmarks, including miniImageNet [vinyals2016matching], tieredImageNet [ren2018meta]



and CUB-200-2011 (CUB) 

[WahCUB_200_2011]. Details of those datasets are summerized in Table 2. All the input images are resized to during comparison, and in the ablation part, we further analyze the stability of BML when enlarging the image size to or introducing other degraded components. In particular, for CUB, we crop the test images with the bounding box provided by [triantafillou2017few] to make a fair comparison.


Images Classes Split


miniImageNet [vinyals2016matching] 60,000 100 64/16/20
tieredImageNet [ren2018meta] 779,165 608 351/97/160
CIFAR-FS [bertinetto2018meta] 60,000 100 60/16/20
CUB-200-2011 [WahCUB_200_2011] 11,788 200 100/50/50


Table 2: Summarization of four benchmarks.

4.2 Implementation Details

Architecture. Following previous works [oreshkin2018tadam, tian2020rethinking, lee2019meta], we use ResNet12 as our backbone, which consists of 4 residual blocks. Each block has 3 convolutional layers with 33 kernel and a 2

2 max-pooling layer. We remove the last global average pooling layer to preserve spatial information. Similar to 

[lee2019meta], we use Dropblock as a regularizer and the number of filters are set to (64, 160, 320, 640). Since BML have a binocular structure, we share the first three blocks and assign an independent block-4 to each view.


Method Backbone miniImageNet tieredImageNet CIFAR-FS
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot


MAML [finn2017model] ConvNet 48.701.84 63.110.92 51.671.81 70.301.75 58.901.90 71.501.00
TAML [jamal2019task] ConvNet 51.771.86 65.600.93 - - - -
MetaOptNet [lee2019meta] ResNet12 64.090.62 80.000.45 65.810.74 81.750.53 72.000.70 84.200.50


ProtoNet [snell2017prototypical] ConvNet 49.420.78 68.200.66 53.310.89 72.690.74 55.500.70 72.000.60
MatchingNet [vinyals2016matching] ConvNet 43.560.84 55.310.73 - - - -
RelationNet [sung2018learning] ConvNet 50.440.82 65.320.70 54.480.93 71.320.78 55.001.00 69.300.80
DeepEMD [zhang2020deepemd] ResNet12 65.910.82 82.410.56 71.160.80 86.030.58 46.470.70 63.220.71
FEAT [ye2020few] ResNet12 66.780.20 82.050.14 70.800.23 84.790.16 - -
TADAM [oreshkin2018tadam] ResNet12 58.500.30 76.700.30 - - - -
CTM [li2019finding] ResNet18 64.120.82 80.510.13 68.410.39 84.281.73 - -
LR+ICI [wang2020instance] ResNet12 66.80n/a 79.26n/a 80.79n/a 87.92n/a 73.97n/a 84.13n/a


Rethink-Distill [tian2020rethinking] ResNet12 64.820.60 82.140.43 71.520.69 86.030.49 73.900.80 86.900.50
DC [lifchitz2019dense] ResNet12 61.260.20 79.010.13 - - - -
MTL [sun2019meta] ResNet12 61.201.80 75.500.80 - - - -
CloserLook++ [chen2019closer] ResNet18 51.870.77 75.680.63 - - - -
Meta-Baseline [chen2020new] ResNet12 63.170.23 79.260.17 68.620.27 83.290.18 - -
Neg-Cosine [liu2020negative] ResNet12 63.850.81 81.570.56 - - - -
AFHN [li2020adversarial] ResNet18 62.380.72 78.160.56 - - 68.320.93 81.450.87
Centroid [afrasiyabi2020associative] ResNet18 59.880.67 80.350.73 69.290.56 85.970.49 - -


Baseline-local ResNet12 58.960.45 77.070.34 64.460.51 82.210.36 67.600.49 84.780.34
Baseline-global ResNet12 61.710.48 81.210.32 63.270.52 82.220.36 69.740.49 87.370.34
BML ResNet12 67.040.63 83.630.29 68.990.50 85.490.34 73.450.47 88.040.33


Table 3: Comparison on miniImageNet, tieredImageNet and CIFAR-FS. Results with are reported in [lee2019meta].

Optimization setup. We use SGD optimizer with a momentum of 0.9 and a weight decay of 5e4. The learning rate is initialized as 0.1. For miniImageNet, CIFAR-FS and CUB, we train 100 epochs. In the 50 epoch, the learning rate is reduced to 6e3, and further reduced to 1.2e4 in the 70 epoch. For tieredImageNet, we train 150 epochs and decay the learning rate by 0.1 times per 40 epochs. In particular, when organizing data, in order to adapt to binocular requirements, we adopt a uniform sampling strategy. The ratio in the loss is set to ::=4:2:1, and the two scale factors and in the elastic loss are experimentally set to 5.5 and 0.1, respectively. To ensure the stability of the evaluation results, for each benchmark, we test 2,000 episodes and report the average performance.

4.3 Experimental Results

4.3.1 Comparison on Coarse-grained Benchmark

Comparison to prior works are shown in Table 3, our results are highlighted in bold with gray background. Specific structural details of ConvNet are reported as follows (filter number of four blocks): MAML [finn2017model]: 32-32-32-32; TAML [jamal2019task], ProtoNet [snell2017prototypical], MatchingNet [vinyals2016matching]: 64-64-64-64; RelationNet [sung2018learning]: 64-96-128-256.

As is reported, compared with two baselines, the performance of BML is remarkable, even 9% higher in some cases (details is analyzed in Section.4.3.3). Compared with other competitors, BML achieves a new SoTA on miniImageNet, including the best metric-based method FEAT [ye2020few]. On tieredImageNet, we also achieve good performance by simply using the nearest neighbor principle. As for CIFAR-FS, we surpass all competitors and reach a new SoTA, including LR+ICI [wang2020instance] which is based on the transductive strategy.

4.3.2 Comparison on Fine-grained Benchmark

Within domain evaluation results is shown in Table 4 (Results with are reported in [chen2019closer]). BML outperforms the runner-up DeepEMD [zhang2020deepemd] by 1.76% and 0.56% at 5-shot and 1-shot setting respectively. Although DeepEMD [zhang2020deepemd] adopts the similar idea of spatial information mining, but the task-dependent patch-wise matching is time-luxurious, while our BML shifts the attention to the bottom embedding, which is time-efficient and valid.


Method Backbone CUB-200-2011
5-way 1-shot 5-way 5-shot


MAML [finn2017model] ResNet18 68.421.07 83.470.62


ProtoNet [snell2017prototypical] ResNet18 72.990.88 86.640.51
MatchingNet [vinyals2016matching] ResNet18 73.490.89 84.450.58
RelationNet [sung2018learning] ResNet18 68.580.94 84.050.56
DeepEMD [zhang2020deepemd] ResNet12 75.650.83 88.690.50


CloserLook [chen2019closer] ResNet18 47.120.74 64.160.71
CloserLook++ [chen2019closer] ResNet18 60.530.83 79.340.61
Centroid [afrasiyabi2020associative] ResNet18 74.221.09 88.650.55


Baseline-local ResNet12 66.790.49 86.550.28
Baseline-global ResNet12 60.130.49 79.770.36
BML ResNet12 76.210.63 90.450.36


Table 4: Within domain comparison on CUB-200-2011.
(a) 1-shot
(b) 5-shot
(c) 1-shot
(d) 5-shot
Figure 3: Effect of binocular learning on multiple benchmarks. The comparison result proves the superiority of our BML.

What’s more, comparing the performance of baseline-global reported in Tables 3-4 and Figure 3, we find that the performance of baseline-global is significantly reduced on fine-grained benchmark CUB, which shows that single global training loses its effect when the inter-class difference is relatively small. However, our BML performs well on all granularities, which proves that binocular framework is more robust against granularity change.


Method miniImageNet CUB


MatchingNet [vinyals2016matching] 53.070.74
ProtoNet [snell2017prototypical] 62.020.70
MAML [finn2017model] 51.340.72
RelationNet [sung2018learning] 57.710.73
CloserLook [chen2019closer] 65.570.70
CloserLook++ [chen2019closer] 62.040.76
Centroid [afrasiyabi2020associative] 70.371.02
Rethink-distill [tian2020rethinking] 68.570.39
BML 72.420.54


Table 5: Cross-domain comparison on CUB-200-2011.

Cross domain evaluation results is reported in Table 5 (Results with are reported in [afrasiyabi2020associative]), BML outperforms all compared methods with a large margin. Among them, the method in [tian2020rethinking] which also employs online soft label to improve the learning of inter-class relations, is 3.85% lower than our binocular mutual learning mechanism. This shows that bidirectional cross-view interaction is better than unidirectional same-view interaction.

4.3.3 Ablation Study

Analyze of Binocular Learning: Specifically, our baselines are two typical single-view methods: baseline-global and a stronger ProtoNet baseline-local. All the experimental configurations of the above two are the same as BML. According to the results shown in Figure 3, we highlight three observations:

1) Binocular mutual training can effectively integrate the advantages of the two views and obtain complementary integrated features. For instance, under 1-shot and 5-shot setting shown in subfigure.3(a) and 3(b), BML is significantly better than the two single-view training methods on both coarse-grained and fine-grained benchmarks.

2) Binocular mutual training makes the two views promote each other. As is shown in subfigure 3(c) and 3(d), BML-local and BML-global based on binocular training are better than baseline-local and baseline-global based on single-view mode, where BML-local is 1.4%-9.7% higher than baseline-local, and BML-global is 1.8%-13.3% higher than baseline-global. Since the only difference is whether the training is under binocular pattern, the results confirm the effectiveness of binocular training.

3) Binocular mutual training is less sensitive to shot count. By comparison, we find that BML-local and BML-global show different advantages under different settings. The global branch performs better when shot count is greater than 1, indicating that global view captures richer expressions; while under 1-shot setting, the local branch exhibits a higher advantage, which shows that local view is robust against sampling uncertainty, because local view minimizes the variation of features within a class according to the variation between classes [goldblum2020unraveling]. Binocular aggregation avoids the instability of single view and absorbs the advantages of two branches.

Analyze of Mutual Interaction (): Applying mutual interaction loss on the binocular framework, the performance is further improved by 0.79% as follows:

w/o w/


BML 82.84 83.63()

Besides, we separately analyze the impact of on single view. Taking the local branch as example, the performance comparison with or without is as follows:

w/o w/


BML-Local 79.29 80.95()
Figure 4: t-SNE results on miniImageNet base classes.

Together with qualitative results in Figure 4, we find that compared with the left one (w/o ), after introducing (right), the cluster structure on base classes is slightly broke but the performance on novel is significantly improved, which shows that in addition to the single hard label, minimizing the mimicry loss helps to alleviate the over-fitting problem while improving transferability.

Analyze of Elastic Loss (ELloss): We simply update the basic euclidean distance with the elastic loss, and keep the other settings unchanged. Here is the comparison result (all the experiments are conducted on miniImageNet):

w/o ELloss w/ ELloss


baseline-Local 77.07 77.77()

Obviously, the introduction of ELloss has brought a great performance improvement. Next, we made a further detailed analysis:

Effect of and : and in Algorithm.1 are two factors controlling the push degree. Among them, [4, 6], this range is derived from the observation of normal training: stably falls in [6, 12] after convergence. Therefore, we enlarge about 50%. The range of is [0.05, 0.25], which controls push degree cross tasks. As shown in Figure 5, simply employing ELloss on baseline-local can obtain a significant improvement (best configuration , ), which shows that, compare to equally treats all the tasks, redefining the difficulty is beneficial to dig out more transferable knowledge.

Figure 5: Effect of and (train 100 epochs with =5).

Effect of Elastic Loss under different settings: We further analyze ELloss in Figure 6, and the result shows a fact that local mode is indeed a relatively simple problem since the matching process only occurs within current episode, which naturally, makes the network too lazy to give satisfied solutions. The introduction of ELloss delays the appearance of performance saturation points, thereby alleviating the problem of sub-optimal solutions to some extent.

Figure 6: Performance trend of Elastic Loss with different .

Analyze of Stability: A good model should perform well in any situation, especially when attacked by degraded input. The following stability test in Table 6 is carried out in several groups of degraded cases: size change; blur attack; noise attack and brightness attack. Results show that our BML performs well in any situation, which shows the stability of binocular learning strategy.


DeepEMD [zhang2020deepemd] 82.4178.12
BML 83.5981.57


GaussianBlur ()
Rethink-Distill [tian2020rethinking] 82.14
BML 83.5961.96


pepperNoise ()
Rethink-Distill [tian2020rethinking] 82.1463.97
BML 83.5971.02


ColorJitter ()
Rethink-Distill [tian2020rethinking] 82.1481.05
BML 83.5982.24


Table 6: Stability evaluation on miniImageNet.

Is Similarity Ranking important for a good model? We randomly visualize a task (Figure 7) and find an interesting sidelight: while improving discriminability, BML also gets more accurate inter-class relationship and more accurate heatmap. This inspired us to think: for the current closed-set setup of FSC, should we pay attention to the similarity ranking besides the top-1 accuracy? Since semantical effective ranking is more practical and can further distinguish the advantages of existing methods.

Figure 7: Similarity Ranking on a 4-way 1-shot task (golden retriever, dalmatian, lion and bus). (a) Our BML, (b)single global view, (c)single local view. BML gives accurate ranking results while the left two failed to get inter-class relationship.

5 Conclusion

Inspired by mutual learning paradigm and binocular parallax, we propose a unified Binocular Mutual Learning (BML) framework, which achieves the compatibility of the global view and the local view through both intra-view and cross-view modeling. The effectiveness of BML has been fully demonstrated on both within-domain and cross-domain evaluations. The aggregated features are more robust than other competitors when dealing with degradation attacks. Besides, BML obtain accurate similarity ranking.

Acknowledgement. This work is supported by the National Key R&D Plan of the Ministry of Science and Technology (Project No. 2020AAA0104400).

Appendix A More Experimental Results

a.1 More Shots

As the number of visible samples (support shots) increases (Figure  8), performance gradually improves, and BML is steadily higher than the two single-view baselines. Besides, the performance of BML-global and BML-local under the binocular mode is superior to baseline-global and baseline-local under single view mode.

(a) 10-shot
(b) 20-shot
Figure 8: Comparison between BML and the two baselines with more shots.

a.2 More Benchmarks

To further verify the performance of BML, we do experiments on another public few-shot classification benchmark: FC100. FC100 is derived from CIFAR-100, which has a total of 100 classes. Among them, 60 classes are used for training, 20 are used for verification, and the remaining 20 are used for testing. Since the division is carried out at the superclass level, the information overlap between splits is minimized, thus more challenging.

As is shown in Table 7, on FC100, BML is still superior to the two single-view baselines, and stays ahead of the other six competitors. Specifically, three key points are conveyed which have been emphasized in Section 4:

  • [itemsep=-5pt,topsep=5pt]

  • Binocular learning is better than single-view mode. BML is 2% higher than the single global view and 5%-9% higher than the single local view.

  • On coarse-grained dataset, global view performs better than local view.

  • The two complementary views can promote each other (i.e., BML-global vs. baseline-global, BML-local vs. baseline-local), and the global impact on the local view is more obvious.


Method FC100
5-way 1-shot 5-way 5-shot


MAML 38.101.70 50.401.00
MetaOptNet 41.100.60 55.500.60
ProtoNet 35.300.60 48.600.60
TADAM 40.100.40 56.100.40
Rethink 42.600.70 59.100.60
DC 42.040.17 57.050.16


Baseline-local 38.880.38 54.250.40
Baseline-global 42.610.39 61.030.40
BML-local 43.250.41 58.700.39
BML-global 43.880.40 62.060.39
BML 45.000.41 63.030.41


Table 7: Comparison on FC100.

a.3 More Analyze of elastic loss

We carefully monitor the elastic loss and further explore its mechanism. Figure 9 shows the trend of training loss and distance between prototypes with or without elastic loss (On miniImageNet). Obviously, comparing the left subfigure of Figure 9(a) with the one of Figure 9(b), we can find that when no elastic loss is applied, the loss value quickly drops to a low point, and the subsequent decline has been very slow. On the contrary, after applying the elastic loss, the initial loss value is increased significantly, and the downward trend is more obvious. This shows that elastic loss does increase the difficulty of optimization. Furthermore, as shown in right subfigures of Figure 9(a) and Figure 9(b), the distance between

prototypes (first-order moment) shows a similar change. With the help of elastic loss, the distance between prototypes is gradually expanded, and the features are more dispersed in the embedding space. This shows that the network is learning to amplify the difference between prototypes to improve matching accuracy.

(a) w/o elastic loss
(b) w/ elastic loss
Figure 9: Loss value and mean distance between prototypes on the base set.

a.4 More Visualization of tasks

We randomly visualize two tasks in Figure 10, from left to right, they represent BML, baseline-global and baseline-local. Obviously, the prototypes (highlight with star) computed by BML are more dispersed in the embedding space, which proves that BML helps to obtain more discriminative features.

Figure 10: t-SNE visualization results on two tasks.

a.5 More analyze of mutual interaction

In order to further verify the influence of mutual interaction on performance, we design a series of ablation experiments, including the impact of the number of shared blocks and the influence of mutual interaction. Here is the result (S:share, I:independent).


Methods Accuracy Params.


(a) Ensemble 81.08 0.31 24,930,688
(b) BML 83.10 0.30 24,930,688
(c) BML 83.24 0.30 24,813,504
(d) BML 83.30 0.29 24,249,024
(e) BML 83.63 0.29 21,891,264


Table 8: Analysis of the number of shared blocks.

According to the results shown in Table 8, comparing (a) and (b), the simple integration of baseline-global and baseline-local without interactive learning has almost no benefit since the difference between two models is relatively large (see in Table 2), while (b) still has good performance, it is mutual interactive learning ensures that the features of the two branches have similarities while maintaining appropriate differences. Comparing (b)-(e), the performance of BML changes relatively gently, which shown the main factor that affects the performance is whether performing binocular mutual learning. To reduce the amount of parameters, BML only separates the last block.

Appendix B Efficient Implementation of BML

To fully unleash the power of binocular framework, during training, we adopt uniform sampling strategy. Specifically, a batch contains randomly sampled classes.