S2SD: Simultaneous Similarity-based Self-Distillation for Deep Metric Learning

09/17/2020 ∙ by Karsten Roth, et al. ∙ 16

Deep Metric Learning (DML) provides a crucial tool for visual similarity and zero-shot retrieval applications by learning generalizing embedding spaces, although recent work in DML has shown strong performance saturation across training objectives. However, generalization capacity is known to scale with the embedding space dimensionality. Unfortunately, high dimensional embeddings also create higher retrieval cost for downstream applications. To remedy this, we propose S2SD - Simultaneous Similarity-based Self-distillation. S2SD extends DML with knowledge distillation from auxiliary, high-dimensional embedding and feature spaces to leverage complementary context during training while retaining test-time cost and with negligible changes to the training time. Experiments and ablations across different objectives and standard benchmarks show S2SD offering notable improvements of up to 7 setting a new state-of-the-art. Code available at https://github.com/MLforHealth/S2SD.



There are no comments yet.


page 17

Code Repositories


(ICML 2021) Implementation for S2SD - Simultaneous Similarity-based Self-Distillation for Deep Metric Learning. Paper Link: https://arxiv.org/abs/2009.08348

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Metric Learning (DML) aims to learn embedding space () models in which a predefined distance metric reflects not only the semantic similarities between training samples, but also transfers to unseen classes. The generalization capabilities of these models are important for applications in image retrieval (Wu et al., 2017)

, face recognition

(Schroff et al., 2015), clustering (Bouchacourt et al., 2018) and representation learning (He et al., 2019)

. Still, transfer learning into unknown test distributions remains an open problem, with

Roth et al. (2020) and Musgrave et al. (2020) revealing strong performance saturation across DML training objectives. However, Roth et al. (2020) also show that embedding space dimensionality () can be a driver for generalization across objectives due to higher representation capacity. Indeed, this insight can be linked to recent work targeting other objective-independent improvements to DML via artificial samples (Zheng et al., 2019)

, higher feature distribution moments

(Jacob et al., 2019) or orthogonal features (Milbich et al., 2020a), which have shown promising relative improvements over selected DML objectives. Unfortunately, these methods come at a cost; be it longer training times or limited applicability. Similarly, drawbacks can be found when naively increasing the operating (base) , incurring increased cost for data retrieval at test time, which is especially problematic on larger datasets. This limits realistically usable s and leads to benchmarks being evaluated against fixed, predefined s.

In this work, we propose Simultaneous Similarity-based Self-Distillation (S2SD) to show that complex higher-dimensional information can actually be effectively leveraged in DML without changing the base and test time cost, which we motivate from two key elements. Firstly, in DML, an additional

can be spanned by a multilayer perceptron (MLP) operating over the feature representation shared with the base

(see e.g. (Milbich et al., 2020a)). With larger , we can thus cheaply learn a secondary high-dimensional simultaneously, also denoted as target . Relative to the large feature backbone, and with the batchsize capping the number of additional high dimensional operations, only little additional training cost is introduced. While we can not utilize the high-dim. target at test time for aforementioned reasons, we may utilize it to boost the performance of the base . Unfortunately, a simple connection of base and target s through the shared feature backbone is insufficient for the base to benefit from the auxiliary, high-dimensional information. Thus, secondly, to efficiently leverage the high-dimensional context, we use insights from knowledge distillation (Hinton et al., 2015), where a small “student” model is trained to approximate a larger “teacher” model. However, while knowledge distillation can be found in DML (Chen et al., 2017b), few-shot learning (Tian et al., 2020) and self-supervised extensions thereof (Rajasegaran et al., 2020), the reliance on additional, commonly larger teacher networks or multiple training runs (Furlanello et al., 2018), introduces much higher training cost. Fortunately, we find that the target learned simultaneously at higher dimension can sufficiently serve as a “teacher” during training - through knowledge distillation of its sample similarities, the performance of the base can be improved notably. Such distillation intuitively encourages the lower-dimensional base to embed semantic similarities similar to the more expressive target and thus incorporate dimensionality-related generalization benefits.

Furthermore, S2SD makes use of the low cost to span additional to introduce multiple target s. Operating each of them at higher, but varying dimensionality, joint distillation can then be used to enforce reusability in the distilled content akin to feature reusability in meta-learning (Raghu et al., 2019) for additional generalization boosts. Finally, in DML, the base is spanned over a penultimate feature space of much higher dimensionality, which introduces a dimensionality-based bottleneck (Milbich et al., 2020b). By applying the distillation objective between feature and base embedding space in S2SD, we further encourage better feature usage in base . This facilitates the approximation of high-dimensional context through the base for additional improvements in generalization.

The benefits to generalization are highlighted in performance boosts across three standard benchmarks, CUB200-2011 (Wah et al., 2011), CARS196 (Krause et al., 2013) and Stanford Online Products (Oh Song et al., 2016), where S2SD improves test-set recall@1 of already strong DML objectives by up to , while also setting a new state-of-the-art. Improvements are even more significant in very low dimensional base s, making S2SD attractive for large-scale retrieval problems which can benefit from reduced s. Importantly, as S2SD is applied during the same DML training process on the same network backbone, no large teacher networks or additional training runs are required. Simple experiments even show that S2SD can outperform comparable 2-stage distillation at much lower cost.

In summary, our contributions can be described as:
1) We propose Simultaneous Similarity-based Self-Distillation (S2SD) for DML, using knowledge distillation of high-dimensional context without large additional teacher networks or training runs.
2) We motivate and evaluate this approach through detailed ablations and experiments, showing that the method is agnostic to choices in objectives, backbones, and datasets.
3) Across benchmarks, we achieve significant improvements over strong baseline objectives and state-of-the-art performance, with especially large boosts for very low-dimensional embedding spaces.

2 Related Work

Deep Metric Learning (DML) has proven useful for zero-shot image/video retrieval & clustering (Schroff et al., 2015; Wu et al., 2017; Brattoli et al., 2020), face verification (Liu et al., 2017; Deng et al., 2018) and contrastive (self-supervised) representation learning (e.g. He et al. (2019); Chen et al. (2020); Misra and van der Maaten (2019)). Approaches can be divided into 1) improved ranking losses, 2) tuple sampling methods and 3) extensions to the standard DML training approach.
1) Ranking losses place constraints on relations in image tuples ranging from pairs (e.g. Hadsell et al. (2006)) to triplets (Schroff et al., 2015) and more complex orderings (Chen et al., 2017a; Oh Song et al., 2016; Sohn, 2016; Wang et al., 2019). 2)

The number of possible tuples scales exponentially with dataset size, leading to many tuple sampling approaches to ensure meaningful tuples presented during training. These tuple sampling methods can follow heuristics (

Schroff et al. (2015); Wu et al. (2017)), be of hierarchical nature (Ge, 2018) or learned (Roth et al., 2020). Similarly, learnable proxies to replace tuple members (Movshovitz-Attias et al., 2017; Kim et al., 2020; Qian et al., 2019) can also remedy the sampling issue, which can be extended to tackle DML from a classification viewpoint (Zhai and Wu, 2018; Deng et al., 2018). 3) Finally, extensions to the basic training scheme can involve synthetic data (Lin et al., 2018; Zheng et al., 2019; Duan et al., 2018), complementary features (Roth et al., 2019; Milbich et al., 2020a), a division into subspaces (Sanakoyeu et al., 2019; Xuan et al., 2018; Kim et al., 2018; Opitz et al., 2018) or higher-order moments (Jacob et al., 2019).
While S2SD is similarly an extension to DML, we specifically focus on capturing and distilling complex high-dimensional sample relations into a lower embedding space to improve generalization.

Knowledge Distillation

involves knowledge transfer from teacher to (usually smaller) student models, e.g. by matching network softmax outputs/logits

(Buciluǎ et al., 2006; Hinton et al., 2015), (attention-weighted) feature maps (Romero et al., 2014; Zagoruyko and Komodakis, 2016), or latent representations (Ahn et al., 2019; Park et al., 2019; Tian et al., 2019; Laskar and Kannala, 2020). Importantly, Tian et al. (2019) show that under fair comparison, basic matching via Kullback-Leibler (KL) Divergences as used in Hinton et al. (2015) performs very well, which we also find to be the case for S2SD. This is further supported in recent few-shot learning literature (Tian et al., 2020), wherein KL-distillation alongside self-distillation (Furlanello et al., 2018) in additional meta-training stages improves feature representation strength important for generalization (Raghu et al., 2019).
S2SD is a form of self-distillation, i.e. distilling without a separate, larger teacher network. However, we leverage dimensionality-related context for distillation, which allows S2SD to be used in the same training run. S2SD also stands in contrast to existing knowledge distillation applications to DML, which are done in a generic manner with separate, larger teacher networks and additional training stages (Chen et al., 2017b; Yu et al., 2019; Han et al., 2019; Laskar and Kannala, 2020).

3 Method

We now introduce key elements for Simultaneous Similarity-based Self-Distillation (S2SD) to improve generalization of embedding spaces by utilizing higher dimensional context. We start with preliminary notation and fundamentals to Deep Metric Learning (§3.1). We then define the three key elements to S2SD: Firstly, the Dual Self-Distillation (DSD) objective, which uses KL-Distillation on a concurrently learned high-dimensional embedding space (§3.2) to introduce the high-dimensional context into a low-dimensional embedding space during training. We then extend this to Multiscale Self-Distillation (MSD) with distillation from several different high-dimensional embedding spaces to encourage reusability in the distilled context (§3.3). Finally, we shift to self-distillation from normalized feature representations to counter dimensionality bottlenecks (MSDF) (§3.4).

3.1 Preliminaries

DML builds on generic Metric Learning which aims to find a (parametrized) distance metric on the feature space over images that best satisfy ranking constraints usually defined over class labels . This holds also for DML. However, while Metric Learning relies on a fixedfeature extraction method to obtain

, DML introduces deep neural networks to concurrently learn a feature representation. Most such DML approaches aim to learn Mahalanobis distance metrics, which cover the parametrized family of inner product metrics

(Suárez et al., 2018; Chen et al., 2019). These metrics, with some restrictions (Suárez et al., 2018), can be reformulated as


with learned linear projection from -dim. features to -dim. embeddings with embedding function . Importantly, this redefines the motivation behind DML as learning -dimensional image embeddings s.t. their euclidean distance is connected to semantic similarities in . This embedding-based formulation offers the significant advantage of being compatible with fast approximate similarity search methods (e.g. Johnson et al. (2017)), allowing for large-scale applications at test time. In this work, we assume to be normalized to the unit hypersphere , which is commonly done (Wu et al., 2017; Sanakoyeu et al., 2019; Liu et al., 2017; Wang and Isola, 2020) for beneficial regularizing purposes (Wu et al., 2017; Wang and Isola, 2020). For the remainder we hence set to refer to .
Common approaches to learn such a representation space involve training surrogates on ranking constraints defined by class labels. Such approaches start from pair or triplet-based ranking objectives (Hadsell et al., 2006; Schroff et al., 2015), where the latter is defined as


with margin and the set of available triplets in a mini-batch , with . This can be extended with more complex ranking constraints or tuple sampling methods. We refer to Supp. B and Roth et al. (2020) for further insights and detailed studies.

3.2 Embedding Space Self-Distillation

Figure 1: S2SD. We use a standard encoder , embedding , and multiple auxiliary embedding networks (used only during training) depending on the S2SD approach used. During training, for each batch of embeddings produced by the respective embedding network , we compute DML losses while applying embedding distillation on the respective batch-similarity matrices (DSD/MSD). We further distill from the feature representation space for additional information gain (MSDF).

For the aforementioned standard DML setting, generalization performance of a learned embedding space can be linked to the utilized embedding dimensionality. However, high dimensionality results in notably higher retrieval cost on downstream applications, which limits realistically usable dimensions. In S2SD, we show that high-dimensional context can be used as a teacher during the training run of the low-dimensional base or reference embedding space. As the base embedding model is also the one that is evaluated, test time retrieval costs are left unchanged. To achieve this, we simultaneously train an additional high-dimensional auxiliary/target embedding space spanned by a secondary embedding branch . is parametrized by a MLP or a linear projection, similar to the base embedding space spanned by , see §3.1. Both and operate on the same large, shared feature backbone . For simplicity, we train and using the same DML objective .
Unfortunately, higher expressivity and improved generalization of high-dimensional embeddings in hardly benefit the base embedding space, even with a shared feature backbone. To explicitly leverage high-dimensional context for our base embedding space, we utilize knowledge distillation from target to base space. However, while common knowledge distillation approaches match single embeddings or features between student and teacher, the different dimensionalities in and inhibit naive matching. Instead, S2SD matches sample relations (see e.g. Tian et al. (2019)) defined over batch-similarity matrices in base and target space, and , with batchsize . We thus encourage the base embedding space to relate different samples in a similar manner to the target space. To compute

, we use a cosine similarity by default, given as

, since is normalized to the unit hypersphere. Defining as the softmax operation and

as the Kullback-Leibler-divergence, we thus define the simultaneous self-distillation objective as


with temperature , as visualized in Figure 1. () denotes no gradient flow to target branches as we only want the base space to learn from the target space. By default, we match rows or columns of , , effectively distilling the relation of an anchor embedding to all other batch samples. Embedding all batch samples in base dimension, , and higher dimension, , the (simultaneous) Dual Self-Distillation (DSD) training objective then becomes


3.3 Reusable Sample Relations by Multiscale Self-distillation

While DSD encourages the reference embedding space to recover complex sample relations by distilling from a higher-dimensional target space spanned by , it is not known a priori which distillable sample relations actually benefit generalization of the reference space.
To encourage the usage of sample relations that more likely aid generalization, we follow insights made in Raghu et al. (2019) on the connection between reusability of features across multiple tasks and better generalization thereof. We motivate reusability in S2SD by extending DSD to Multiscale Self-Distillation (MSD) with distillation instead from multiple different target spaces spanned by . Importantly, each of these high-dimensional target spaces operate on different dimensionalities, i.e. . As this results in each target embedding space encoding sample relations differently, application of distillation across all spaces spanned by pushes the base branch towards learning from sample relations that are reusable across all higher dimensional embedding spaces and thereby more likely to generalize (see also Fig. 1).
Specifically, given the set of target similarity matrices and target batch embeddings , we then define the MSD training objective as


3.4 Tackling the Dimensionality Bottleneck by Feature Space Self-Distillation

As noted in §3.1, the base embedding space utilizes a linear projection from the (penultimate) feature space where is commonly much larger than . While compressed semantic spaces encourage stronger representations (Alemi et al., 2016; Dai and Wipf, 2019) to be learned, Milbich et al. (2020b) show that the actual test performance of the lower-dimensional embedding space is inferior to that of the non-adapted, but higher-dimensional feature space . This supports a dimensionality-based loss of information beneficial to generalization, which can hinder the base embedding space to optimally approximate the high-dimensional context introduced in §3.2 and 3.3.
To rectify this, we apply self-distillation (eq. 3) on the normalized feature representations generated by the normalized backbone as well. With the batch of normalized feature representations we get multiscale self-distillation with feature distillation (MSDF) (see also Fig. 1)


In the same manner, one can also address other architectural information bottlenecks such as through the generation of feature representations from a single global pooling operation. While not noted in the original publication, Kim et al. (2020) address this in the official code release by using both global max- and average pooling to create their base embedding space. While this naive usage changes the architecture at test time, in S2SD we can fairly leverage potential benefits by only spanning the auxiliary spaces (and distilling) from such feature representations (denoted as DSDA/MSDA/MSDFA).

4 Experimental Setup

We study S2SD in four experiments to establish 1) method ablation performance & relative improvements, 2) state-of-the-art, 3) comparisons to standard 2-stage distillation, benefits to low-dimensional embedding spaces & generalization properties and 4) motivation for architectural choices.

Method Notation. We abbreviate ablations of S2SD (see §3) in our experiments as: DSD & MSD for Dual (3.2) & Multiscale Self-Distillation (3.3), MSDF the addition of Feature distillation (3.4) and DSDA/MSD(F)A the inclusion of multiple pooling operations in the auxiliary branches (also §3.4).

4.1 Experiments

Fair Evaluation of Ablations. §5.1 specifically applies S2SD and its ablations to three DML baselines. To show realistic benefit, S2SD is applied to best-performing objectives evaluated in Roth et al. (2020), namely (i) Margin loss with Distance-based Sampling (Wu et al., 2017), (ii) their proposed Regularized Margin loss and (iii) Multisimilarity loss (Wang et al., 2019), following their experimental training pipeline. This setup utilizes no learning rate scheduling and fixes common implementational factors of variation in DML pipelines such as batchsize, base embedding dimension, weight decay or feature backbone architectures to ensure comparability in DML (more details in Supp. A.2). As such, our results are directly comparable to their large set of examined methods and guaranteed that relative improvements solely stem from the application of S2SD.

Evaluation Across Architectures and Embedding Dimensions. §5.2 further highlights the benefits of S2SD by comparing S2SD’s boosting properties across literature standards, with different backbone architectures and base embedding dimensions: (1) ResNet50 with = 128 (Wu et al., 2017; Roth et al., 2019) and (2) = 512 (Zhai and Wu, 2018) as well as (3)

variants to Inception-V1 with Batch-Normalization at

= 512 (Wang et al., 2019; Qian et al., 2019; Milbich et al., 2020a). Only here do we conservatively apply learning rate scheduling, since all references noted in Table 2 employ scheduling as well. We categorize published work based on backbone architecture and embedding dimension for fairer comparison. Note that this is a less robust comparison than done in §5.1, due to potential implementation differences between our pipeline and reported literature results.

Comparison to 2-Stage Distillation and Generalization Study. §5.3 compares S2SD to 2-stage distillation, investigates benefits to very low dimensional reference spaces and examines the connection between improvements and increased embedding space feature richness, measured by density and spectral decay (see Supp. D), which are linked to improved generalization in Roth et al. (2020).

Investigation of Method Choices. §5.4 finally ablates and motivates specific architectural choices in S2SD used throughout §4. Pseudo code and detailed results are available in Supp. F, G, and H.

4.2 Implementation

Datasets & Evaluation. In all experiments, we evaluate on standard DML benchmarks: CUB200-2011 (Wah et al., 2011), CARS196 (Krause et al., 2013) and Stanford Online Products (SOP) (Oh Song et al., 2016). Performance is measured in recall at 1 (R@1) and at 2 (R@2) (Jegou et al., 2011) as well as Normalized Mutual Information (NMI) (Manning et al., 2010). More details in Supp. A & C.

Experimental Details. Our implementation follows Roth et al. (2020), with more details in Supp. (A). For §5.1-5.4, we only adjust the respective pipeline elements in questions. For S2SD, unless noted otherwise (s.a. in §5.4), we set for all objectives on CUB200 and CARS196, and on SOP. DSD uses target-dim. and MSD target-dims. . We found it beneficial to activate the feature distillation after iterations for CUB200, CARS196 and SOP, respectively, to ensure that meaningful features are learned first before feature distillation is applied. The additional embedding spaces are generated by two layer MLPs with row-wise KL-distillation of similarities (eq. 3), applied as in (eq. 5). By default, we use Multisimilarity Loss as stand-in for .

Benchmark CUB200-2011 CARS196 SOP
Approaches R@1 NMI R@1 NMI R@1 NMI
Margin, , (Wu et al., 2017)
R-Margin, , (Roth et al., 2020)
Multisimilarity (Wang et al., 2019)
Table 1: S2SD comparison against strong baseline objectives. All results computed over multiple seeds. Bold denotes best results per loss & benchmark, bluebold marks best results per benchmark.

5 Results

5.1 S2SD Improves Performance under Fair Evaluation

In Tab. 1 (full table in Supp. Tab. 4), we show that under the fair experimental protocol used in Roth et al. (2020), utilizing S2SD and its ablations gives an objective and benchmark independent, significant boost in performance by up to opposing the exisiting DML objective performance plateau. This holds even for regularized objectives s.a. R-Margin loss, highlighting the effectiveness of S2SD for DML. Across objectives, S2SD-based changes in wall-time do not exceed negligible .

5.2 S2SD Achieves SOTA Across Architecture and Embedding Dimension

Benchmarks CUB200 (Wah et al., 2011) CARS196 (Krause et al., 2013) SOP (Oh Song et al., 2016)
Methods R@1 R@2 NMI R@1 R@2 NMI R@1 R@10 NMI
Div&Conq (Sanakoyeu et al., 2019) 65.9 76.6 69.6 84.6 90.7 70.3 75.9 88.4 90.2
MIC (Roth et al., 2019) 66.1 76.8 69.7 82.6 89.1 68.4 77.2 89.4 90.0
PADS (Roth et al., 2020) 67.3 78.0 69.9 83.5 89.7 68.8 76.5 89.0 89.9
Multisimilarity+S2SD 68.0 0.2 78.7 0.1 71.7 0.4 86.3 0.1 91.8 0.3 72.0 0.3 79.0 0.2 90.2 0.1 90.6 0.1
Margin+S2SD 67.6 0.3 78.2 0.2 70.8 0.3 86.0 0.2 91.8 0.2 72.2 0.2 80.2 0.2 91.5 0.1 90.9 0.1
R-Margin+S2SD 68.9 0.3 79.0 0.3 72.1 0.4 87.6 0.2 92.7 0.2 72.3 0.2 79.2 0.2 90.3 0.1 90.8 0.1
EPSHN (Xuan et al., 2020) 64.9 75.3 - 82.7 89.3 - 78.3 90.7 -
NormSoft (Zhai and Wu, 2018) 61.3 73.9 - 84.2 90.4 - 78.2 90.6 -
DiVA (Milbich et al., 2020a) 69.2 79.3 71.4 87.6 92.9 72.2 79.6 91.2 90.6
Multisimilarity+S2SD 69.2 0.1 79.1 0.2 71.4 0.2 89.2 0.2 93.8 0.2 74.0 0.2 80.8 0.2 92.2 0.2 90.5 0.3
Margin+S2SD 68.8 0.2 78.5 0.2 72.3 0.1 89.3 0.2 93.8 0.2 73.7 0.3 81.0 0.2 92.1 0.2 91.1 0.3
R-Margin+S2SD 70.1 0.2 79.7 0.2 71.6 0.2 89.5 0.2 93.9 0.3 72.9 0.3 80.0 0.2 91.4 0.2 90.8 0.1
DiVA (Milbich et al., 2020a) 66.8 77.7 70.0 84.1 90.7 68.7 78.1 90.6 90.4
Multisimilarity+S2SD 66.7 0.3 77.5 0.3 70.5 0.2 83.8 0.3 90.3 0.2 69.8 0.3 78.5 0.2 90.6 0.2 90.6 0.1
Margin+S2SD 66.8 0.2 77.9 0.2 69.9 0.3 84.3 0.2 90.7 0.2 69.8 0.2 78.4 0.2 90.5 0.2 90.4 0.1
R-Margin+S2SD 67.4 0.3 78.0 0.4 70.3 0.2 83.9 0.3 90.3 0.2 69.4 0.2 78.1 0.2 90.4 0.3 90.3 0.2
Softtriple (Qian et al., 2019) 65.4 76.4 69.3 84.5 90.7 70.1 78.3 90.3 92.0
Multisimilarity (Wang et al., 2019) 65.7 77.0 - 84.1 90.4 - 78.2 90.5 -
Multisimilarity+S2SD 68.2 0.3 79.1 0.2 71.6 0.2 86.3 0.2 92.2 0.2 72.0 0.3 78.9 0.2 90.8 0.2 90.6 0.1
Margin+S2SD 68.3 0.2 78.8 0.2 71.2 0.2 87.1 0.2 92.4 0.1 72.2 0.2 79.1 0.2 91.0 0.3 90.4 0.1
R-Margin+S2SD 69.6 0.3 79.6 0.3 71.2 0.1 86.6 0.3 92.1 0.3 70.9 0.2 78.5 0.1 90.5 0.2 90.0 0.2
Table 2: State-of-the-art comparison. We show that S2SD, represented by its variants MSDF(A), boosts baseline objectives to state-of-the-art across literature. () stands for Inception-V1 with frozen Batch-Norm. Bold: best results per literature setup. Bluebold: best results per overall benchmark.

Motivated by Tab. 1, we use MSDFA for CUB200/CARS196 and MSDF for SOP. Table 2 shows that S2SD

can boost baseline objectives to reach and even surpass SOTA, in parts with a notable margin, even when reported with confidence intervals, which is commonly neglected in DML.

S2SD outperforms much more complex methods with feature mining or RL-policies s.a. MIC (Roth et al., 2019), DiVA (Milbich et al., 2020a) or PADS (Roth et al., 2020).

5.3 S2sd is a strong substitute for normal distillation & learns generalizing embedding spaces across dimensionalities.

Comparison to standard distillation. With student S (same objective/embed. dim. as the reference branch in DSD) and a teacher T at highest optimal dim. , we find separating DSD into standard 2-stage distillation degenerates performance (see Fig. 3A, compare to Dist.). S2SD also allows for easy integration of teacher ensembles, realized by MSD(F,A), to even outperform the teacher notably while operating on the embedding dimensionality of the student.

Benefits to lower base dimensions. We show that our module is able to vastly boost networks limited to very low embedding dimensions (c.f. 3B). For example, networks trained with S2SD can match the performance of embed. dimensions four or eight times the size. For , S2SD even outperforms the highest dimensional baseline at notably.

Figure 2: Generalization metrics. S2SD increases embed. space density and lowers spectral decay.

Embedding space metrics. We now look at relative changes in embedding space density and spectral decay as in Roth et al. (2020), although we investigate changes within the same objectives. Fig. 2 shows S2SD increasing embedding space density and lowering the spectral decay (hence providing a more feature-diverse embedding space) across criteria.

5.4 Motivating S2SD Architecture Choices

Is distillation in S2SD important? Fig. 3A (Joint) and Fig. 3F () highlight how crucial self-distillation is, as using a secondary embedding space without hardly improves performance. Fig. 3A (Concur.) shows that joint training of a detached reference embedding while otherwise training in high dimension also doesn’t offer notable improvement. Finally, Figure 3F shows robustness to changes in , with peaks around and for CUB200/CARS196 and SOP. We also found best performance for temperatures and hence set by default.

Best way to enforce reusability. To motivate our many-to-one self-distillation (eq. 5, here also dubbed ), we evaluate against other distillation setups that could support reusability of distilled sample relations: (1) Nested distillation, where instead of distilling all target spaces only to the reference space, we distill from a target space to all lower-dimensional embedding spaces:


In the second term, denotes the base embedding . (2) Chained distillation, which distills from a target space only to the lower-dim. embedding space closest in dimensionality:


Figure 3E shows that while either distillation method provides strong benefits, a many-to-one distillation performs notably better, supporting the reusability aspect and as our default method.

Choice of distillation method & branch structures. Fig. 3C evaluates various distillation objectives, finding KL-divergence between vectors of similarities to perform better than KL-divergence applied over full similarity matrices or row-wise means thereof, as well as cosine/euclidean distance-based distillation (see e.g. (Yu et al., 2019)). Figure 3D shows insights into optimal auxiliary branch structures, with two-layer MLPs giving the largest benefit, although even a linear target mapping reliably boosts performance. This coincides with insights made by Chen et al. (2020). Further network depth only deteriorates performance.

Figure 3: S2SD study and ablations. (A) DSD outperforms comparable two-stage distillation on student S (Dist.) using teacher (T), with MSD(FA) even outperforming the teacher. We further see that distillation is essential for improvement - training multiple spaces in parallel (Joint.) or a detached lower-dimensional base embedding (Concur.) gives little benefit. (B) We see benefits across base dimensionalities, especially in the low-dimensional regime. (C) We find KL-distillation between similarity vectors (R-KL) to work best. (D) An additional non-linearity in aux. branches gives a boost, but going deeper degenerates generalization. (E) Distilling each aux. embed. space (Multi) to the reference space compares favourable against other distillation setups s.a. Nested and Chained distillation. (F) We find performance to be robust to changes in weight values.

6 Conclusion

In this paper, we propose a novel knowledge-distillation based DML training paradigm, Simultaneous Similarity-based Self-Distillation (S2SD), to utilize high-dimensional context for improved generalization. S2SD solves the standard DML objective simultaneously in higher-dimensional embedding spaces while applying knowledge distillation concurrently between these high-dimensional teacher spaces and a lower-dimensional reference space. S2SD introduces little additional computational overhead, with no extra cost at test time. Thorough ablations and experiments show S2SD significantly improving the generalization performance of existing DML objectives regardless of embedding dimensionality, while also setting a new state-of-the-art on standard benchmarks.


We would like to thank Samarth Sinha (University of Toronto, Vector), Matthew McDermott (MIT) and Mengye Ren (University of Toronto, Vector) for insightful discussions and feedback on the paper draft. This work was funded in part by a CIFAR AI Chair at the Vector Institute, Microsoft Research, and an NSERC Discovery Grant. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute



  • S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai (2019) Variational information distillation for knowledge transfer.

    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    External Links: ISBN 9781728132938, Link, Document Cited by: §2.
  • A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy (2016) Deep variational information bottleneck. External Links: 1612.00410 Cited by: §3.4.
  • D. Bouchacourt, R. Tomioka, and S. Nowozin (2018)

    Multi-level variational autoencoder: learning disentangled representations from grouped observations

    In AAAI 2018, Cited by: §1.
  • B. Brattoli, J. Tighe, F. Zhdanov, P. Perona, and K. Chalupka (2020) Rethinking zero-shot video classification: end-to-end training for realistic applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil (2006) Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. Cited by: §2.
  • S. Chen, L. Luo, J. Yang, C. Gong, J. Li, and H. Huang (2019) Curvilinear distance metric learning. In Advances in Neural Information Processing Systems 32, pp. 4223–4232. External Links: Link Cited by: §3.1.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. External Links: 2002.05709 Cited by: §2, §5.4.
  • W. Chen, X. Chen, J. Zhang, and K. Huang (2017a) Beyond triplet loss: a deep quadruplet network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • Y. Chen, N. Wang, and Z. Zhang (2017b) DarkRank: accelerating deep metric learning via cross sample similarities transfer. External Links: 1707.01220 Cited by: §1, §2.
  • B. Dai and D. Wipf (2019) Diagnosing and enhancing vae models. External Links: 1903.05789 Cited by: §3.4.
  • J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2018) ArcFace: additive angular margin loss for deep face recognition. External Links: 1801.07698 Cited by: §2.
  • Y. Duan, W. Zheng, X. Lin, J. Lu, and J. Zhou (2018) Deep adversarial metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar (2018) Born again neural networks. External Links: 1805.04770 Cited by: §1, §2.
  • W. Ge (2018) Deep metric learning with hierarchical triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–285. Cited by: §2.
  • R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §3.1.
  • J. Han, T. Zhao, and C. Zhang (2019) Deep distillation metric learning. Proceedings of the ACM Multimedia Asia. Cited by: §2.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. External Links: 1911.05722 Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §A.2.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. pp. 448–456. External Links: Link Cited by: §A.2.
  • P. Jacob, D. Picard, A. Histace, and E. Klein (2019) Metric learning with horde: high-order regularizer for deep embeddings. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • H. Jegou, M. Douze, and C. Schmid (2011) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: Appendix C, §4.2.
  • J. Johnson, M. Douze, and H. Jégou (2017) Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: Appendix E, §3.1.
  • S. Kim, D. Kim, M. Cho, and S. Kwak (2020) Proxy anchor loss for deep metric learning. External Links: 2003.13911 Cited by: §A.2, §2, §3.4.
  • W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon (2018) Attention-based ensemble for deep metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. Cited by: §A.2.
  • J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561. Cited by: §A.1, §1, §4.2, Table 2.
  • Z. Laskar and J. Kannala (2020) Data-efficient ranking distillation for image retrieval. External Links: 2007.05299 Cited by: §2.
  • X. Lin, Y. Duan, Q. Dong, J. Lu, and J. Zhou (2018) Deep variational metric learning. In The European Conference on Computer Vision (ECCV), Cited by: §2.
  • W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017) SphereFace: deep hypersphere embedding for face recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2, §3.1.
  • S. P. Lloyd (1982) Least squares quantization in pcm. IEEE Trans. Information Theory 28, pp. 129–136. Cited by: Appendix C.
  • C. Manning, P. Raghavan, and H. Schütze (2010) Introduction to information retrieval. Natural Language Engineering 16 (1), pp. 100–103. Cited by: Appendix C, §4.2.
  • T. Milbich, K. Roth, H. Bharadhwaj, S. Sinha, Y. Bengio, B. Ommer, and J. P. Cohen (2020a) DiVA: diverse visual feature aggregation for deep metric learning. External Links: 2004.13458 Cited by: §1, §1, §2, §4.1, §5.2, Table 2.
  • T. Milbich, K. Roth, B. Brattoli, and B. Ommer (2020b) Sharing matters for generalization in deep metric learning. External Links: 2004.05582 Cited by: §1, §3.4.
  • I. Misra and L. van der Maaten (2019)

    Self-supervised learning of pretext-invariant representations

    External Links: 1912.01991 Cited by: §2.
  • Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: §2.
  • K. Musgrave, S. Belongie, and S. Lim (2020) A metric learning reality check. External Links: 2003.08505 Cited by: §1.
  • H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012. Cited by: §A.1, §1, §2, §4.2, Table 2.
  • M. Opitz, G. Waltner, H. Possegger, and H. Bischof (2018) Deep metric learning with bier: boosting independent embeddings robustly. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
  • W. Park, D. Kim, Y. Lu, and M. Cho (2019) Relational knowledge distillation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781728132938, Link, Document Cited by: §2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in pytorch

    In NIPS-W, Cited by: §A.2, Appendix F.
  • Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin (2019) SoftTriple loss: deep metric learning without triplet sampling. Cited by: §2, §4.1, Table 2.
  • A. Raghu, M. Raghu, S. Bengio, and O. Vinyals (2019) Rapid learning or feature reuse? towards understanding the effectiveness of maml. External Links: 1909.09157 Cited by: §1, §2, §3.3.
  • J. Rajasegaran, S. Khan, M. Hayat, F. S. Khan, and M. Shah (2020) Self-supervised knowledge distillation for few-shot learning. External Links: 2006.09785 Cited by: §1.
  • A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2014) FitNets: hints for thin deep nets. External Links: 1412.6550 Cited by: §2.
  • K. Roth, B. Brattoli, and B. Ommer (2019) MIC: mining interclass characteristics for improved metric learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8000–8009. Cited by: §2, §4.1, §5.2, Table 2.
  • K. Roth, T. Milbich, and B. Ommer (2020) PADS: policy-adapted sampling for visual similarity learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §5.2, Table 2.
  • K. Roth, T. Milbich, S. Sinha, P. Gupta, B. Ommer, and J. P. Cohen (2020) Revisiting training strategies and generalization performance in deep metric learning. External Links: 2002.08473 Cited by: §A.2, §A.2, Appendix B, Appendix B, Appendix D, Appendix D, §1, §3.1, §4.1, §4.1, §4.2, Table 1, §5.1, §5.3.
  • A. Sanakoyeu, V. Tschernezki, U. Buchler, and B. Ommer (2019) Divide and conquer the embedding space for metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.1, Table 2.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1, §2, §3.1.
  • K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §2.
  • J. L. Suárez, S. García, and F. Herrera (2018) A tutorial on distance metric learning: mathematical foundations, algorithms and experiments. External Links: 1812.05944 Cited by: §3.1.
  • Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive representation distillation. External Links: 1910.10699 Cited by: §2, §3.2.
  • Y. Tian, Y. Wang, D. Krishnan, J. B. Tenenbaum, and P. Isola (2020) Rethinking few-shot image classification: a good embedding is all you need?. External Links: 2003.11539 Cited by: §1, §2.
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §A.1, §1, §4.2, Table 2.
  • T. Wang and P. Isola (2020) Understanding contrastive representation learning through alignment and uniformity on the hypersphere. External Links: 2005.10242 Cited by: §3.1.
  • X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott (2019) Multi-similarity loss with general pair weighting for deep metric learning. External Links: 1904.06627 Cited by: Appendix B, §2, §4.1, §4.1, Table 1, Table 2.
  • C. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl (2017) Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848. Cited by: Appendix B, §1, §2, §3.1, §4.1, §4.1, Table 1.
  • H. Xuan, R. Souvenir, and R. Pless (2018) Deep randomized ensembles for metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 723–734. Cited by: §2.
  • H. Xuan, A. Stylianou, and R. Pless (2020) Improved embeddings with easy positive triplet mining. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Cited by: Table 2.
  • L. Yu, V. O. Yazici, X. Liu, J. van de Weijer, Y. Cheng, and A. Ramisa (2019) Learning metrics from teachers: compact networks for image embedding. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781728132938, Link, Document Cited by: §2, §5.4.
  • S. Zagoruyko and N. Komodakis (2016)

    Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer

    External Links: 1612.03928 Cited by: §2.
  • A. Zhai and H. Wu (2018) Classification is a strong baseline for deep metric learning. External Links: 1811.12649 Cited by: §2, §4.1, Table 2.
  • W. Zheng, Z. Chen, J. Lu, and J. Zhou (2019) Hardness-aware deep metric learning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1, §2.

Supplementary: Simultaneous Multiscale Self-Distillation for Deep Metric Learning


Appendix A More Benchmark & Implementation Details

In this part, we report all relevant benchmark details omitted in the main document as well as further implementation details.

a.1 Benchmarks

CUB200-2011 (Wah et al., 2011) contains 200 bird classes over 11,788 images, where the first and last 100 classes with 5864/5924 images are used for training and testing, respectively.
CARS196 (Krause et al., 2013) contains 196 car classes and 16,185 images, where again the first and last 98 classes with 8054/8131 images are used to create the training/testing split.
Stanford Online Products (SOP) (Oh Song et al., 2016) is build around 22,634 product classes over 120,053 images and contains a provided split: 11318 selected classes with 59551 images are used for training, and 11316 classes with 60502 images for testing.

a.2 Implementation

We now provide further details regarding the training and testing setup utilized. For any study except the comparison against the state-of-the-art (Table 2) which uses different backbones and embedding dimensions, we follow the setup used by Roth et al. (2020)111Repository: github.com/Confusezius/Revisiting_Deep_Metric_Learning_PyTorch: This includes a ResNet50 He et al. (2016) with frozen Batch-Normalization Ioffe and Szegedy (2015), normalization of the output embeddings with dimensionality and optimization with Adam Kingma and Ba (2015) using a learning rate of and weight decay of . The input images are randomly resized and cropped from the original image size to for training. Further augmentation by random horizontal flipping with is applied. During testing, center crops of size are used. The batchsize is set to .

Training runs on CUB200-2011 and CARS196 are done over 150 epochs and 100 epochs for SOP for all experiments without any learning rate scheduling, except for the state-of-the-art experiments (see again

2). For the latter, we made use of slightly longer training to account for conservative learning rate scheduling, which is similarly done across reference methods noted in tab. 2. Schedule and decay values are determined over validation subset performances. All baseline DML objectives we apply our self-distillation module S2SD on use the default parameters noted in Roth et al. (2020) with the single exception of Margin Loss on SOP, where we found class margins to be more beneficial for distillation than the default . This was done as changing from to had no notable impact on the baseline performance. Finally, similar to Kim et al. (2020), we found a warmup epoch of all MLPs to improve convergence on SOP. Spectral decay computations in §5.3 follow the setting described in Supp. D.

We implement everything in PyTorch (Paszke et al., 2017)

. Experiments are done on GPU servers containing Nvidia Titan X, P100 and T4s, however memory usage never exceeds 12GB. Each result is averaged over five seeds, and for the sake of reproducibilty and result validity, we report mean and standard deviation, even though this is commonly neglected in DML literature.

Appendix B Baseline Methods

This section provides a more detailed explanation of the DML baseline objectives we used alongside our self-distillation module S2SD in the experimental section 4. For additional details, we refer to the supplementary material in Roth et al. (2020). For the mathematical notation, we refer to Section 3.1. We use to denote the feature network with embedding , and the embedding of a sample

. Finally, alongside the method descriptions we provide the used hyperparameters.

Margin Loss (Wu et al., 2017) builds on triplet/pair-based losses, but introduces both class-specific, learnable boundaries (with number of classes ) between positive and negative pairs, as well as distance-based sampling for negatives:


where denotes the available pairs in minibatch , and the embedding dimension. Throughout this work, we use except for S2SD on SOP, where we found to work better without changing the baseline performance. We set the learning rate for the class boundaries as , and margin .

Regularized Margin Loss (Roth et al., 2020)

proposes a simple regularization scheme on the margin loss that increases the number of directions of significant variance in the embedding space by randomly exchanging a negative sample with a positive one with probability

. For ResNet-backbones, we use for CUB200, for CARS196 and for SOP as done in Roth et al. (2020). For Inception-based backbones, we set for CUB200 and CARS196 and for SOP.

Multisimilarity Loss Wang et al. (2019) incorporates more similarities into training by operating directly on all positive and negative samples for an anchor , while also incorporating a sampling operation that encourages the usage of harder training samples:


where denotes the cosine similarity instead of the euclidean distance, and the set of positives and negatives for in the minibatch, respectively. We use the default values , , and .

Appendix C Evaluation Metrics

The evaluation metrics used throughout this work are recall @ 1 (R@1), recall @ 2 (R@2) and Normalized Mutual Information (NMI), capturing two distinct embedding space properties.

Recall@K, see e.g. in Jegou et al. (2011), especially Recall@1 and Recall@2, is the primary metric used to compare the performance of DML methods and approaches, as it offers strong insights into retrieval performances of the learned embedding spaces. Given the set of embedded samples with and , and the sorted set of nearest neighbours for any sample ,


Recall@K is measured as


which evaluates how likely semantically corresponding pairs (as determined here by the labelling ) will occur in a neighbourhood of size .

Normalized Mutual Information (NMI), see Manning et al. (2010), evaluates the clustering quality of the embedded samples (taken from ). It is computed by first clustering with

cluster centers, usually corresponding to the number of classes available, using a cluster method of choice s.a. K-Means

(Lloyd, 1982). This assigns each sample a cluster label/id based on the nearest cluster centroid. With the set of samples with cluster label, the set of cluster sets, the set of samples with true label and the set of class label sets, the Normalized Mutual Information is given as


with mutual information and entropy .

Appendix D Generalization Metrics

Embedding Space Density. Given sets of embeddings , we first define the average inter-class distance as


which measures the average distances between groups of embeddings with respective classes and

, estimated by the respective class centers

. denotes a normalization constant based on the number of available classes. We also introduce the average intra-class distance as the mean distance between samples within their respective class


again with normalization constant and set of embeddings with class , . Given these two quantities, the embedding space density is then defined as


and effectively measured how densely samples and classes are grouped together. Roth et al. (2020) show that optimizing the DML problem while keeping the embedding space density high, i.e. without aggressive clustering, encourages better generalization to unseen test classes.

Spectral Decay. The spectral decay metric defines the KL-divergence between the (sorted) spectrum of singular values

(obtained via Singular Value Decomposition (SVD)) and a

-dimensional uniform distribution

, and is inversly related to the entropy of the embedding space:


It does not account for class distributions. Roth et al. (2020) show that doing DML while encouraging a high-entropy feature space notably benefits the generalization performance. In our experiments, we disregard the first 10 singular vectors (out of 128) to highlight the feature diversity. This is important, as we evaluate the spectral decay within the same objectives, which results in the first few singular values to be highly similar.

Appendix E Additional Experiments

This part extends the set of ablations experiments performed in section 5.4 in the main paper.
a. Detaching target spaces for distillation. We examine whether it is preferable to detach the target embeddings from the distillation loss (see eq. 3), as we want the reference embedding space to approximate the higher-dimensional relations. Similarly, we do not want the target embedding networks to reduce high-dimensional to lower-dimensional relations to optimizer for the distillation constraint. As can be seen in fig 4C, it is indeed the case that detaching the target embedding spaces is notably beneficial for a stronger reference embedding, supporting the previous motivation.
b. Influence of varying target dimensions. As noted at the beginning of section 4, we set the target dimension for dual self-distillation (DSD) to , which we motivate through a small ablation study in fig. 4A, with TD denoting the target dimension of choice. As can be seen, benefits plateau when the target dimension reaches more than four times the reference dimension. However, to be directly comparable to high-dimensional reference settings, we set as default.
c. Ablating multiple distillation scales. Going further, we extend the module with additional embedding branches to the multiscale self-distillation approach (MSD), all operating in different, but higher-than-reference dimension. As already shown in Figure 3B in the main paper, there is a benefit of multiscale distillations by encouraging reusable sample relations. In this part, we motivate the choice of four target branches (as noted in sec. 4). Looking at figure 4A, where denotes the number of additional target spaces, we can see a benefit in multiple additional target spaces of ascending dimension. As the improvements saturate after , we simply set this as the default value. However, the additional benefits of going to multiscale from dual distillation are not as high as going from no to dual target space distillation, showcasing the general benefit of high-dimensional concurrent self-distillation. Finally, we highlight that a multiscale approach slightly outperforms a multibranch distillation setup (Fig. 4A, Multi-B) where each target branch has the same target dimension of while introducing less additional parameters.
d. Finer-grained feature distillation. As already shown in section 4 and again in figure 4B, we see benefits of feature distillation, using the (globally averaged) normalized penultimate feature space. It therefore makes sense to investigate the benefits of distilling even more fine-grained feature representation. Defining as the pooling window size applied to the non-average penultimate feature representation, we investigate less compressed feature representation space. As can be seen in fig. 4B, where denotes the index to , there appears to be no benefits in distilling feature representations higher up the network.
e. Runtime comparison of base dimensionalities. We highlight relative retrieval times at different base dimensionalities in Tab. 3 using faiss (Johnson et al., 2017) on a NVIDIA 1080Ti and a synthetic set of embeddings of dimensionality . With S2SD matching to base dimensionalities (see §5.3), runtime can be reduced by up to a magnitude.

Figure 4: Additional ablations. (A) Increasing target dimensions offers notable improvements. We opt for a target dimension of 2048 due to slightly higher mean improvements. For multiple embedding branches (#B), there seems to be an optimum at four branches. (B) Furthermore, feature distillation gives another notable boost. However, this only holds for the globally averaged penultimate feature representation. When distilling more fine-grained feature representations, performance degenerates (where #P denotes smaller pooling windows applied to the penultimate feature representation). (C) We show that detached auxiliary branches for distillation are crucial to higher improvements, as we want the reference embedding space to approximate the higher-dimensional one.
Dimensionality 32 64 128 256 512 1024 2048
Runtime (s) 1.540.00 1.980.00 2.710.00 4.350.00 7.380.01 13.830.02 27.210.17
Table 3: Sample retrieval times for 250000 embeddings with varying base dimensionalities.

Appendix F Pseudo-Code

To facilitate reproducibility, we provide pseudo-code based on PyTorch (Paszke et al., 2017).

1import torch, torch.nn as nn, torch.nn.functional as F
2from F import normalize as norm
6    self.base_criterion: base DML objective
7    self.trgt_criteria: list of DML objectives for target spaces
8    self.trgt_nets: Module list of auxiliary embedding MLPs
9    self.dist_gamma: distillation weight
10    self.it_before_feat_distill: iterations before feature distill
13def forward(self, batch, labels, pre_batch, **kwargs):
14    """
15    Args:
16        batch: image embeddings, shape: bs x d
17        labels: image labels, shape: bs
18        pre_batch: penultimate network features, shape: bs x d*
19    """
20    bs, batch = len(batch), norm(batch, dim=-1)
22    ### Compute ref. sample relations and loss on ref. embedding space
23    base_smat = batch.mm(batch.T)
24    base_loss = self.base_criterion(batch, labels, **kwargs)
26    ### Do global average pooling (and max. pool if wanted)
27    avg_pre_batch  = nn.AdaptiveAvgPool2d(1)(pre_batch).view(bs,-1)
28    avg_pre_batch += nn.AdaptiveMaxPool2d(1)(pre_batch).view(bs,-1)
30    ### Computing MSDA loss (Targets & Distillations)
31    dist_losses, trgt_losses  = [], []
32    for trgt_crit,trgt_net in zip(self.trgt_criteria,self.trgt_nets):
33        trgt_batch     = norm(trgt_net(avg_pre_batch),dim=-1)
34        trgt_loss      = trgt_crit(trgt_batch, labels, **kwargs)
35        trgt_smat      = trgt_batch.mm(trgt_batch.T)
36        base_trgt_dist = self.kl_div(base_smat, trgt_smat.detach())
37        trgt_losses.append(trgt_loss)
38        dist_losses.append(base_trgt_dist)
40    ### MSDA loss
41    multi_dist_loss  = (base_loss+torch.stack(trgt_losses).mean())/2.
42    multi_dist_loss += self.dist_gamma*torch.stack(dist_losses).mean()
44    ### Distillation of penultimate features -> MSDFA
45    src_feat_dist = 0
46    if self.it_count>=self.it_before_feat_distill:
47        n_avg_pre_batch = norm(avg_pre_batch, dim=-1).detach()
48        avg_feat_smat   = n_avg_pre_batch.mm(n_avg_pre_batch.T)
49        src_feat_dist   = self.kl_div(base_smat, avg_feat_smat.detach())
51    ### Total S2SD training objective
52    total_loss = multi_distill_loss + self.dist_gamma*src_feat_dist
53    self.it_count+=1
54    return total_loss
56def kl_div(self, A, B, T=1):
57    log_p_A = F.log_softmax(A/self.T, dim=-1)
58    p_B     = F.softmax(B/self.T, dim=-1)
59    kl_d    = F.kl_div(log_p_A, p_B,reduction=’sum’)*T**2/A.size(0)
60    return kl_d
Listing 1: PyTorch Implementation for S2SD.

Appendix G Detailed Evaluation Results

This table contains all method ablations for a fair evaluation as used in section 5.2 and table 1.

Benchmarks CUB200-2011 CARS196 SOP
Approaches R@1 NMI R@1 NMI R@1 NMI
Table 4: Detailed Comparison of Recall@1 and NMI performances against well performing DML objectives examined in section 5.2. This is the complete version to table 1. All results are computed over 5-run averages. () For Margin Loss and SOP, we found to give better distillation results without notably influencing baseline performance.

Appendix H Detailed Ablation Results

Detailed values to the ablation experiments done in section 5.4 and E.

Experiment Setting R@1
Distillation Best Teacher (d=1024)
Base Student (d=128)
Distill Student (d=128)
Concur. Student (d=128)
Joint Student (d=128)
DSD (d=128)
DSDA (d=128)
MSDA (d=128)
MSDFA (d=128)
Table 5: Experiment: Comparison of concurrent self-distillation against standard 2-stage distillation. This table also shows that training without distillation (Joint) or training in high dimension while learning a detached low-dimensional embedding layer (Concur.) does not benefit performance notably. See fig. 3A. All results are computed over 5-run averages.
Experiment Setting R@1 Setting R@1
Embedding Dimensionality Base (d=16) MSD (d=256)
Basic (d=32) MSD (d=512)
Basic (d=64) MSD (d=1024)
Basic (d=128) MSD (d=2048)
Basic (d=256) MSDA (d=16)
Basic (d=512) MSDA (d=32)
Basic (d=1024) MSDA (d=64)
Basic (d=2048) MSDA (d=128)
DSD (d=16) MSDA (d=256)
DSD (d=32) MSDA (d=512)
DSD (d=64) MSDA (d=1024)
DSD (d=128) MSDA (d=2048)