(ICML 2021) Implementation for S2SD - Simultaneous Similarity-based Self-Distillation for Deep Metric Learning. Paper Link: https://arxiv.org/abs/2009.08348
Deep Metric Learning (DML) provides a crucial tool for visual similarity and zero-shot retrieval applications by learning generalizing embedding spaces, although recent work in DML has shown strong performance saturation across training objectives. However, generalization capacity is known to scale with the embedding space dimensionality. Unfortunately, high dimensional embeddings also create higher retrieval cost for downstream applications. To remedy this, we propose S2SD - Simultaneous Similarity-based Self-distillation. S2SD extends DML with knowledge distillation from auxiliary, high-dimensional embedding and feature spaces to leverage complementary context during training while retaining test-time cost and with negligible changes to the training time. Experiments and ablations across different objectives and standard benchmarks show S2SD offering notable improvements of up to 7 setting a new state-of-the-art. Code available at https://github.com/MLforHealth/S2SD.READ FULL TEXT VIEW PDF
Real-world contains an overwhelmingly large number of object classes,
Metric learning networks are used to compute image embeddings, which are...
Deep Metric Learning (DML) aims to find representations suitable for
How do the neural networks distinguish two images? It is of critical
In zero-shot image retrieval (ZSIR) task, embedding learning becomes mor...
Image-based 3D shape retrieval (IBSR) aims to find the corresponding 3D ...
In many real-world problems, collecting a large number of labeled sample...
(ICML 2021) Implementation for S2SD - Simultaneous Similarity-based Self-Distillation for Deep Metric Learning. Paper Link: https://arxiv.org/abs/2009.08348
Deep Metric Learning (DML) aims to learn embedding space () models in which a predefined distance metric reflects not only the semantic similarities between training samples, but also transfers to unseen classes. The generalization capabilities of these models are important for applications in image retrieval (Wu et al., 2017)et al., 2015), clustering (Bouchacourt et al., 2018) and representation learning (He et al., 2019)
. Still, transfer learning into unknown test distributions remains an open problem, withRoth et al. (2020) and Musgrave et al. (2020) revealing strong performance saturation across DML training objectives. However, Roth et al. (2020) also show that embedding space dimensionality () can be a driver for generalization across objectives due to higher representation capacity. Indeed, this insight can be linked to recent work targeting other objective-independent improvements to DML via artificial samples (Zheng et al., 2019)
, higher feature distribution moments(Jacob et al., 2019) or orthogonal features (Milbich et al., 2020a), which have shown promising relative improvements over selected DML objectives. Unfortunately, these methods come at a cost; be it longer training times or limited applicability. Similarly, drawbacks can be found when naively increasing the operating (base) , incurring increased cost for data retrieval at test time, which is especially problematic on larger datasets. This limits realistically usable s and leads to benchmarks being evaluated against fixed, predefined s.
In this work, we propose Simultaneous Similarity-based Self-Distillation (S2SD) to show that complex higher-dimensional information can actually be effectively leveraged in DML without changing the base and test time cost, which we motivate from two key elements. Firstly, in DML, an additional
can be spanned by a multilayer perceptron (MLP) operating over the feature representation shared with the base(see e.g. (Milbich et al., 2020a)). With larger , we can thus cheaply learn a secondary high-dimensional simultaneously, also denoted as target . Relative to the large feature backbone, and with the batchsize capping the number of additional high dimensional operations, only little additional training cost is introduced. While we can not utilize the high-dim. target at test time for aforementioned reasons, we may utilize it to boost the performance of the base . Unfortunately, a simple connection of base and target s through the shared feature backbone is insufficient for the base to benefit from the auxiliary, high-dimensional information. Thus, secondly, to efficiently leverage the high-dimensional context, we use insights from knowledge distillation (Hinton et al., 2015), where a small “student” model is trained to approximate a larger “teacher” model. However, while knowledge distillation can be found in DML (Chen et al., 2017b), few-shot learning (Tian et al., 2020) and self-supervised extensions thereof (Rajasegaran et al., 2020), the reliance on additional, commonly larger teacher networks or multiple training runs (Furlanello et al., 2018), introduces much higher training cost. Fortunately, we find that the target learned simultaneously at higher dimension can sufficiently serve as a “teacher” during training - through knowledge distillation of its sample similarities, the performance of the base can be improved notably. Such distillation intuitively encourages the lower-dimensional base to embed semantic similarities similar to the more expressive target and thus incorporate dimensionality-related generalization benefits.
Furthermore, S2SD makes use of the low cost to span additional to introduce multiple target s. Operating each of them at higher, but varying dimensionality, joint distillation can then be used to enforce reusability in the distilled content akin to feature reusability in meta-learning (Raghu et al., 2019) for additional generalization boosts. Finally, in DML, the base is spanned over a penultimate feature space of much higher dimensionality, which introduces a dimensionality-based bottleneck (Milbich et al., 2020b). By applying the distillation objective between feature and base embedding space in S2SD, we further encourage better feature usage in base . This facilitates the approximation of high-dimensional context through the base for additional improvements in generalization.
The benefits to generalization are highlighted in performance boosts across three standard benchmarks, CUB200-2011 (Wah et al., 2011), CARS196 (Krause et al., 2013) and Stanford Online Products (Oh Song et al., 2016), where S2SD improves test-set recall@1 of already strong DML objectives by up to , while also setting a new state-of-the-art. Improvements are even more significant in very low dimensional base s, making S2SD attractive for large-scale retrieval problems which can benefit from reduced s. Importantly, as S2SD is applied during the same DML training process on the same network backbone, no large teacher networks or additional training runs are required. Simple experiments even show that S2SD can outperform comparable 2-stage distillation at much lower cost.
In summary, our contributions can be described as:
1) We propose Simultaneous Similarity-based Self-Distillation (S2SD) for DML, using knowledge distillation of high-dimensional context without large additional teacher networks or training runs.
2) We motivate and evaluate this approach through detailed ablations and experiments, showing that the method is agnostic to choices in objectives, backbones, and datasets.
3) Across benchmarks, we achieve significant improvements over strong baseline objectives and state-of-the-art performance, with especially large boosts for very low-dimensional embedding spaces.
Deep Metric Learning (DML) has proven useful for zero-shot image/video retrieval & clustering (Schroff et al., 2015; Wu et al., 2017; Brattoli et al., 2020), face verification (Liu et al., 2017; Deng et al., 2018) and contrastive (self-supervised) representation learning (e.g. He et al. (2019); Chen et al. (2020); Misra and van der Maaten (2019)).
Approaches can be divided into 1) improved ranking losses, 2) tuple sampling methods and 3) extensions to the standard DML training approach.
1) Ranking losses place constraints on relations in image tuples ranging from pairs (e.g. Hadsell et al. (2006)) to triplets (Schroff et al., 2015) and more complex orderings (Chen et al., 2017a; Oh Song et al., 2016; Sohn, 2016; Wang et al., 2019). 2)
The number of possible tuples scales exponentially with dataset size, leading to many tuple sampling approaches to ensure meaningful tuples presented during training. These tuple sampling methods can follow heuristics (Schroff et al. (2015); Wu et al. (2017)), be of hierarchical nature (Ge, 2018) or learned (Roth et al., 2020). Similarly, learnable proxies to replace tuple members (Movshovitz-Attias et al., 2017; Kim et al., 2020; Qian et al., 2019) can also remedy the sampling issue, which can be extended to tackle DML from a classification viewpoint (Zhai and Wu, 2018; Deng et al., 2018). 3) Finally, extensions to the basic training scheme can involve synthetic data (Lin et al., 2018; Zheng et al., 2019; Duan et al., 2018), complementary features (Roth et al., 2019; Milbich et al., 2020a), a division into subspaces (Sanakoyeu et al., 2019; Xuan et al., 2018; Kim et al., 2018; Opitz et al., 2018) or higher-order moments (Jacob et al., 2019).
involves knowledge transfer from teacher to (usually smaller) student models, e.g. by matching network softmax outputs/logits(Buciluǎ et al., 2006; Hinton et al., 2015), (attention-weighted) feature maps (Romero et al., 2014; Zagoruyko and Komodakis, 2016), or latent representations (Ahn et al., 2019; Park et al., 2019; Tian et al., 2019; Laskar and Kannala, 2020). Importantly, Tian et al. (2019) show that under fair comparison, basic matching via Kullback-Leibler (KL) Divergences as used in Hinton et al. (2015) performs very well, which we also find to be the case for S2SD. This is further supported in recent few-shot learning literature (Tian et al., 2020), wherein KL-distillation alongside self-distillation (Furlanello et al., 2018) in additional meta-training stages improves feature representation strength important for generalization (Raghu et al., 2019).
We now introduce key elements for Simultaneous Similarity-based Self-Distillation (S2SD) to improve generalization of embedding spaces by utilizing higher dimensional context. We start with preliminary notation and fundamentals to Deep Metric Learning (§3.1). We then define the three key elements to S2SD: Firstly, the Dual Self-Distillation (DSD) objective, which uses KL-Distillation on a concurrently learned high-dimensional embedding space (§3.2) to introduce the high-dimensional context into a low-dimensional embedding space during training. We then extend this to Multiscale Self-Distillation (MSD) with distillation from several different high-dimensional embedding spaces to encourage reusability in the distilled context (§3.3). Finally, we shift to self-distillation from normalized feature representations to counter dimensionality bottlenecks (MSDF) (§3.4).
DML builds on generic Metric Learning which aims to find a (parametrized) distance metric on the feature space over images that best satisfy ranking constraints usually defined over class labels . This holds also for DML. However, while Metric Learning relies on a fixedfeature extraction method to obtain
, DML introduces deep neural networks to concurrently learn a feature representation. Most such DML approaches aim to learn Mahalanobis distance metrics, which cover the parametrized family of inner product metrics(Suárez et al., 2018; Chen et al., 2019). These metrics, with some restrictions (Suárez et al., 2018), can be reformulated as
with learned linear projection from -dim. features to -dim. embeddings with embedding function . Importantly, this redefines the motivation behind DML as learning -dimensional image embeddings s.t. their euclidean distance is connected to semantic similarities in . This embedding-based formulation offers the significant advantage of being compatible with fast approximate similarity search methods (e.g. Johnson et al. (2017)), allowing for large-scale applications at test time.
In this work, we assume to be normalized to the unit hypersphere , which is commonly done (Wu et al., 2017; Sanakoyeu et al., 2019; Liu et al., 2017; Wang and Isola, 2020) for beneficial regularizing purposes (Wu et al., 2017; Wang and Isola, 2020).
For the remainder we hence set to refer to .
Common approaches to learn such a representation space involve training surrogates on ranking constraints defined by class labels. Such approaches start from pair or triplet-based ranking objectives (Hadsell et al., 2006; Schroff et al., 2015), where the latter is defined as
with margin and the set of available triplets in a mini-batch , with . This can be extended with more complex ranking constraints or tuple sampling methods. We refer to Supp. B and Roth et al. (2020) for further insights and detailed studies.
For the aforementioned standard DML setting, generalization performance of a learned embedding space can be linked to the utilized embedding dimensionality. However, high dimensionality results in notably higher retrieval cost on downstream applications, which limits realistically usable dimensions.
In S2SD, we show that high-dimensional context can be used as a teacher during the training run of the low-dimensional base or reference embedding space. As the base embedding model is also the one that is evaluated, test time retrieval costs are left unchanged.
To achieve this, we simultaneously train an additional high-dimensional auxiliary/target embedding space spanned by a secondary embedding branch . is parametrized by a MLP or a linear projection, similar to the base embedding space spanned by , see §3.1. Both and operate on the same large, shared feature backbone .
For simplicity, we train and using the same DML objective .
Unfortunately, higher expressivity and improved generalization of high-dimensional embeddings in hardly benefit the base embedding space, even with a shared feature backbone. To explicitly leverage high-dimensional context for our base embedding space, we utilize knowledge distillation from target to base space. However, while common knowledge distillation approaches match single embeddings or features between student and teacher, the different dimensionalities in and inhibit naive matching. Instead, S2SD matches sample relations (see e.g. Tian et al. (2019)) defined over batch-similarity matrices in base and target space, and , with batchsize . We thus encourage the base embedding space to relate different samples in a similar manner to the target space. To compute
, we use a cosine similarity by default, given as, since is normalized to the unit hypersphere. Defining as the softmax operation and
as the Kullback-Leibler-divergence, we thus define the simultaneous self-distillation objective as
with temperature , as visualized in Figure 1. () denotes no gradient flow to target branches as we only want the base space to learn from the target space. By default, we match rows or columns of , , effectively distilling the relation of an anchor embedding to all other batch samples. Embedding all batch samples in base dimension, , and higher dimension, , the (simultaneous) Dual Self-Distillation (DSD) training objective then becomes
While DSD encourages the reference embedding space to recover complex sample relations by distilling from a higher-dimensional target space spanned by , it is not known a priori which distillable sample relations actually benefit generalization of the reference space.
To encourage the usage of sample relations that more likely aid generalization, we follow insights made in Raghu et al. (2019) on the connection between reusability of features across multiple tasks and better generalization thereof. We motivate reusability in S2SD by extending DSD to Multiscale Self-Distillation (MSD) with distillation instead from multiple different target spaces spanned by . Importantly, each of these high-dimensional target spaces operate on different dimensionalities, i.e. . As this results in each target embedding space encoding sample relations differently, application of distillation across all spaces spanned by pushes the base branch towards learning from sample relations that are reusable across all higher dimensional embedding spaces and thereby more likely to generalize (see also Fig. 1).
Specifically, given the set of target similarity matrices and target batch embeddings , we then define the MSD training objective as
As noted in §3.1, the base embedding space utilizes a linear projection from the (penultimate) feature space where is commonly much larger than . While compressed semantic spaces encourage stronger representations (Alemi et al., 2016; Dai and Wipf, 2019) to be learned, Milbich et al. (2020b) show that the actual test performance of the lower-dimensional embedding space is inferior to that of the non-adapted, but higher-dimensional feature space . This supports a dimensionality-based loss of information beneficial to generalization, which can hinder the base embedding space to optimally approximate the high-dimensional context introduced in §3.2 and 3.3.
To rectify this, we apply self-distillation (eq. 3) on the normalized feature representations generated by the normalized backbone as well. With the batch of normalized feature representations we get multiscale self-distillation with feature distillation (MSDF) (see also Fig. 1)
In the same manner, one can also address other architectural information bottlenecks such as through the generation of feature representations from a single global pooling operation. While not noted in the original publication, Kim et al. (2020) address this in the official code release by using both global max- and average pooling to create their base embedding space. While this naive usage changes the architecture at test time, in S2SD we can fairly leverage potential benefits by only spanning the auxiliary spaces (and distilling) from such feature representations (denoted as DSDA/MSDA/MSDFA).
We study S2SD in four experiments to establish 1) method ablation performance & relative improvements, 2) state-of-the-art, 3) comparisons to standard 2-stage distillation, benefits to low-dimensional embedding spaces & generalization properties and 4) motivation for architectural choices.
Method Notation. We abbreviate ablations of S2SD (see §3) in our experiments as: DSD & MSD for Dual (3.2) & Multiscale Self-Distillation (3.3), MSDF the addition of Feature distillation (3.4) and DSDA/MSD(F)A the inclusion of multiple pooling operations in the auxiliary branches (also §3.4).
Fair Evaluation of Ablations. §5.1 specifically applies S2SD and its ablations to three DML baselines. To show realistic benefit, S2SD is applied to best-performing objectives evaluated in Roth et al. (2020), namely (i) Margin loss with Distance-based Sampling (Wu et al., 2017), (ii) their proposed Regularized Margin loss and (iii) Multisimilarity loss (Wang et al., 2019), following their experimental training pipeline. This setup utilizes no learning rate scheduling and fixes common implementational factors of variation in DML pipelines such as batchsize, base embedding dimension, weight decay or feature backbone architectures to ensure comparability in DML (more details in Supp. A.2). As such, our results are directly comparable to their large set of examined methods and guaranteed that relative improvements solely stem from the application of S2SD.
Evaluation Across Architectures and Embedding Dimensions. §5.2 further highlights the benefits of S2SD by comparing S2SD’s boosting properties across literature standards, with different backbone architectures and base embedding dimensions: (1) ResNet50 with = 128 (Wu et al., 2017; Roth et al., 2019) and (2) = 512 (Zhai and Wu, 2018) as well as (3)
variants to Inception-V1 with Batch-Normalization at= 512 (Wang et al., 2019; Qian et al., 2019; Milbich et al., 2020a). Only here do we conservatively apply learning rate scheduling, since all references noted in Table 2 employ scheduling as well. We categorize published work based on backbone architecture and embedding dimension for fairer comparison. Note that this is a less robust comparison than done in §5.1, due to potential implementation differences between our pipeline and reported literature results.
Comparison to 2-Stage Distillation and Generalization Study. §5.3 compares S2SD to 2-stage distillation, investigates benefits to very low dimensional reference spaces and examines the connection between improvements and increased embedding space feature richness, measured by density and spectral decay (see Supp. D), which are linked to improved generalization in Roth et al. (2020).
Datasets & Evaluation. In all experiments, we evaluate on standard DML benchmarks: CUB200-2011 (Wah et al., 2011), CARS196 (Krause et al., 2013) and Stanford Online Products (SOP) (Oh Song et al., 2016). Performance is measured in recall at 1 (R@1) and at 2 (R@2) (Jegou et al., 2011) as well as Normalized Mutual Information (NMI) (Manning et al., 2010). More details in Supp. A & C.
Experimental Details. Our implementation follows Roth et al. (2020), with more details in Supp. (A). For §5.1-5.4, we only adjust the respective pipeline elements in questions. For S2SD, unless noted otherwise (s.a. in §5.4), we set for all objectives on CUB200 and CARS196, and on SOP. DSD uses target-dim. and MSD target-dims. . We found it beneficial to activate the feature distillation after iterations for CUB200, CARS196 and SOP, respectively, to ensure that meaningful features are learned first before feature distillation is applied. The additional embedding spaces are generated by two layer MLPs with row-wise KL-distillation of similarities (eq. 3), applied as in (eq. 5). By default, we use Multisimilarity Loss as stand-in for .
|Margin, , (Wu et al., 2017)|
|R-Margin, , (Roth et al., 2020)|
|Multisimilarity (Wang et al., 2019)|
In Tab. 1 (full table in Supp. Tab. 4), we show that under the fair experimental protocol used in Roth et al. (2020), utilizing S2SD and its ablations gives an objective and benchmark independent, significant boost in performance by up to opposing the exisiting DML objective performance plateau. This holds even for regularized objectives s.a. R-Margin loss, highlighting the effectiveness of S2SD for DML. Across objectives, S2SD-based changes in wall-time do not exceed negligible .
|Benchmarks||CUB200 (Wah et al., 2011)||CARS196 (Krause et al., 2013)||SOP (Oh Song et al., 2016)|
|Div&Conq (Sanakoyeu et al., 2019)||65.9||76.6||69.6||84.6||90.7||70.3||75.9||88.4||90.2|
|MIC (Roth et al., 2019)||66.1||76.8||69.7||82.6||89.1||68.4||77.2||89.4||90.0|
|PADS (Roth et al., 2020)||67.3||78.0||69.9||83.5||89.7||68.8||76.5||89.0||89.9|
|Multisimilarity+S2SD||68.0 0.2||78.7 0.1||71.7 0.4||86.3 0.1||91.8 0.3||72.0 0.3||79.0 0.2||90.2 0.1||90.6 0.1|
|Margin+S2SD||67.6 0.3||78.2 0.2||70.8 0.3||86.0 0.2||91.8 0.2||72.2 0.2||80.2 0.2||91.5 0.1||90.9 0.1|
|R-Margin+S2SD||68.9 0.3||79.0 0.3||72.1 0.4||87.6 0.2||92.7 0.2||72.3 0.2||79.2 0.2||90.3 0.1||90.8 0.1|
|EPSHN (Xuan et al., 2020)||64.9||75.3||-||82.7||89.3||-||78.3||90.7||-|
|NormSoft (Zhai and Wu, 2018)||61.3||73.9||-||84.2||90.4||-||78.2||90.6||-|
|DiVA (Milbich et al., 2020a)||69.2||79.3||71.4||87.6||92.9||72.2||79.6||91.2||90.6|
|Multisimilarity+S2SD||69.2 0.1||79.1 0.2||71.4 0.2||89.2 0.2||93.8 0.2||74.0 0.2||80.8 0.2||92.2 0.2||90.5 0.3|
|Margin+S2SD||68.8 0.2||78.5 0.2||72.3 0.1||89.3 0.2||93.8 0.2||73.7 0.3||81.0 0.2||92.1 0.2||91.1 0.3|
|R-Margin+S2SD||70.1 0.2||79.7 0.2||71.6 0.2||89.5 0.2||93.9 0.3||72.9 0.3||80.0 0.2||91.4 0.2||90.8 0.1|
|DiVA (Milbich et al., 2020a)||66.8||77.7||70.0||84.1||90.7||68.7||78.1||90.6||90.4|
|Multisimilarity+S2SD||66.7 0.3||77.5 0.3||70.5 0.2||83.8 0.3||90.3 0.2||69.8 0.3||78.5 0.2||90.6 0.2||90.6 0.1|
|Margin+S2SD||66.8 0.2||77.9 0.2||69.9 0.3||84.3 0.2||90.7 0.2||69.8 0.2||78.4 0.2||90.5 0.2||90.4 0.1|
|R-Margin+S2SD||67.4 0.3||78.0 0.4||70.3 0.2||83.9 0.3||90.3 0.2||69.4 0.2||78.1 0.2||90.4 0.3||90.3 0.2|
|Softtriple (Qian et al., 2019)||65.4||76.4||69.3||84.5||90.7||70.1||78.3||90.3||92.0|
|Multisimilarity (Wang et al., 2019)||65.7||77.0||-||84.1||90.4||-||78.2||90.5||-|
|Multisimilarity+S2SD||68.2 0.3||79.1 0.2||71.6 0.2||86.3 0.2||92.2 0.2||72.0 0.3||78.9 0.2||90.8 0.2||90.6 0.1|
|Margin+S2SD||68.3 0.2||78.8 0.2||71.2 0.2||87.1 0.2||92.4 0.1||72.2 0.2||79.1 0.2||91.0 0.3||90.4 0.1|
|R-Margin+S2SD||69.6 0.3||79.6 0.3||71.2 0.1||86.6 0.3||92.1 0.3||70.9 0.2||78.5 0.1||90.5 0.2||90.0 0.2|
can boost baseline objectives to reach and even surpass SOTA, in parts with a notable margin, even when reported with confidence intervals, which is commonly neglected in DML.S2SD outperforms much more complex methods with feature mining or RL-policies s.a. MIC (Roth et al., 2019), DiVA (Milbich et al., 2020a) or PADS (Roth et al., 2020).
Comparison to standard distillation. With student S (same objective/embed. dim. as the reference branch in DSD) and a teacher T at highest optimal dim. , we find separating DSD into standard 2-stage distillation degenerates performance (see Fig. 3A, compare to Dist.). S2SD also allows for easy integration of teacher ensembles, realized by MSD(F,A), to even outperform the teacher notably while operating on the embedding dimensionality of the student.
Benefits to lower base dimensions. We show that our module is able to vastly boost networks limited to very low embedding dimensions (c.f. 3B). For example, networks trained with S2SD can match the performance of embed. dimensions four or eight times the size. For , S2SD even outperforms the highest dimensional baseline at notably.
Embedding space metrics. We now look at relative changes in embedding space density and spectral decay as in Roth et al. (2020), although we investigate changes within the same objectives. Fig. 2 shows S2SD increasing embedding space density and lowering the spectral decay (hence providing a more feature-diverse embedding space) across criteria.
Is distillation in S2SD important? Fig. 3A (Joint) and Fig. 3F () highlight how crucial self-distillation is, as using a secondary embedding space without hardly improves performance. Fig. 3A (Concur.) shows that joint training of a detached reference embedding while otherwise training in high dimension also doesn’t offer notable improvement. Finally, Figure 3F shows robustness to changes in , with peaks around and for CUB200/CARS196 and SOP. We also found best performance for temperatures and hence set by default.
Best way to enforce reusability. To motivate our many-to-one self-distillation (eq. 5, here also dubbed ), we evaluate against other distillation setups that could support reusability of distilled sample relations: (1) Nested distillation, where instead of distilling all target spaces only to the reference space, we distill from a target space to all lower-dimensional embedding spaces:
In the second term, denotes the base embedding . (2) Chained distillation, which distills from a target space only to the lower-dim. embedding space closest in dimensionality:
Figure 3E shows that while either distillation method provides strong benefits, a many-to-one distillation performs notably better, supporting the reusability aspect and as our default method.
Choice of distillation method & branch structures. Fig. 3C evaluates various distillation objectives, finding KL-divergence between vectors of similarities to perform better than KL-divergence applied over full similarity matrices or row-wise means thereof, as well as cosine/euclidean distance-based distillation (see e.g. (Yu et al., 2019)). Figure 3D shows insights into optimal auxiliary branch structures, with two-layer MLPs giving the largest benefit, although even a linear target mapping reliably boosts performance. This coincides with insights made by Chen et al. (2020). Further network depth only deteriorates performance.
In this paper, we propose a novel knowledge-distillation based DML training paradigm, Simultaneous Similarity-based Self-Distillation (S2SD), to utilize high-dimensional context for improved generalization. S2SD solves the standard DML objective simultaneously in higher-dimensional embedding spaces while applying knowledge distillation concurrently between these high-dimensional teacher spaces and a lower-dimensional reference space. S2SD introduces little additional computational overhead, with no extra cost at test time. Thorough ablations and experiments show S2SD significantly improving the generalization performance of existing DML objectives regardless of embedding dimensionality, while also setting a new state-of-the-art on standard benchmarks.
We would like to thank Samarth Sinha (University of Toronto, Vector), Matthew McDermott (MIT) and Mengye Ren (University of Toronto, Vector) for insightful discussions and feedback on the paper draft. This work was funded in part by a CIFAR AI Chair at the Vector Institute, Microsoft Research, and an NSERC Discovery Grant. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institutewww.vectorinstitute.ai/#partners.
Multi-level variational autoencoder: learning disentangled representations from grouped observations. In AAAI 2018, Cited by: §1.
Self-supervised learning of pretext-invariant representations. External Links: Cited by: §2.
Automatic differentiation in pytorch. In NIPS-W, Cited by: §A.2, Appendix F.
Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. External Links: Cited by: §2.
In this part, we report all relevant benchmark details omitted in the main document as well as further implementation details.
CUB200-2011 (Wah et al., 2011) contains 200 bird classes over 11,788 images, where the first and last 100 classes with 5864/5924 images are used for training and testing, respectively.
CARS196 (Krause et al., 2013) contains 196 car classes and 16,185 images, where again the first and last 98 classes with 8054/8131 images are used to create the training/testing split.
Stanford Online Products (SOP) (Oh Song et al., 2016) is build around 22,634 product classes over 120,053 images and contains a provided split: 11318 selected classes with 59551 images are used for training, and 11316 classes with 60502 images for testing.
We now provide further details regarding the training and testing setup utilized. For any study except the comparison against the state-of-the-art (Table 2) which uses different backbones and embedding dimensions, we follow the setup used by Roth et al. (2020)111Repository: github.com/Confusezius/Revisiting_Deep_Metric_Learning_PyTorch: This includes a ResNet50 He et al. (2016) with frozen Batch-Normalization Ioffe and Szegedy (2015), normalization of the output embeddings with dimensionality and optimization with Adam Kingma and Ba (2015) using a learning rate of and weight decay of . The input images are randomly resized and cropped from the original image size to for training. Further augmentation by random horizontal flipping with is applied. During testing, center crops of size are used. The batchsize is set to .
Training runs on CUB200-2011 and CARS196 are done over 150 epochs and 100 epochs for SOP for all experiments without any learning rate scheduling, except for the state-of-the-art experiments (see again2). For the latter, we made use of slightly longer training to account for conservative learning rate scheduling, which is similarly done across reference methods noted in tab. 2. Schedule and decay values are determined over validation subset performances. All baseline DML objectives we apply our self-distillation module S2SD on use the default parameters noted in Roth et al. (2020) with the single exception of Margin Loss on SOP, where we found class margins to be more beneficial for distillation than the default . This was done as changing from to had no notable impact on the baseline performance. Finally, similar to Kim et al. (2020), we found a warmup epoch of all MLPs to improve convergence on SOP. Spectral decay computations in §5.3 follow the setting described in Supp. D.
We implement everything in PyTorch (Paszke et al., 2017)
. Experiments are done on GPU servers containing Nvidia Titan X, P100 and T4s, however memory usage never exceeds 12GB. Each result is averaged over five seeds, and for the sake of reproducibilty and result validity, we report mean and standard deviation, even though this is commonly neglected in DML literature.
This section provides a more detailed explanation of the DML baseline objectives we used alongside our self-distillation module S2SD in the experimental section 4. For additional details, we refer to the supplementary material in Roth et al. (2020). For the mathematical notation, we refer to Section 3.1. We use to denote the feature network with embedding , and the embedding of a sample
. Finally, alongside the method descriptions we provide the used hyperparameters.
Margin Loss (Wu et al., 2017) builds on triplet/pair-based losses, but introduces both class-specific, learnable boundaries (with number of classes ) between positive and negative pairs, as well as distance-based sampling for negatives:
where denotes the available pairs in minibatch , and the embedding dimension. Throughout this work, we use except for S2SD on SOP, where we found to work better without changing the baseline performance. We set the learning rate for the class boundaries as , and margin .
Regularized Margin Loss (Roth et al., 2020)
proposes a simple regularization scheme on the margin loss that increases the number of directions of significant variance in the embedding space by randomly exchanging a negative sample with a positive one with probability. For ResNet-backbones, we use for CUB200, for CARS196 and for SOP as done in Roth et al. (2020). For Inception-based backbones, we set for CUB200 and CARS196 and for SOP.
Multisimilarity Loss Wang et al. (2019) incorporates more similarities into training by operating directly on all positive and negative samples for an anchor , while also incorporating a sampling operation that encourages the usage of harder training samples:
where denotes the cosine similarity instead of the euclidean distance, and the set of positives and negatives for in the minibatch, respectively. We use the default values , , and .
The evaluation metrics used throughout this work are recall @ 1 (R@1), recall @ 2 (R@2) and Normalized Mutual Information (NMI), capturing two distinct embedding space properties.
Recall@K, see e.g. in Jegou et al. (2011), especially Recall@1 and Recall@2, is the primary metric used to compare the performance of DML methods and approaches, as it offers strong insights into retrieval performances of the learned embedding spaces. Given the set of embedded samples with and , and the sorted set of nearest neighbours for any sample ,
Recall@K is measured as
which evaluates how likely semantically corresponding pairs (as determined here by the labelling ) will occur in a neighbourhood of size .
Normalized Mutual Information (NMI), see Manning et al. (2010), evaluates the clustering quality of the embedded samples (taken from ). It is computed by first clustering with
cluster centers, usually corresponding to the number of classes available, using a cluster method of choice s.a. K-Means(Lloyd, 1982). This assigns each sample a cluster label/id based on the nearest cluster centroid. With the set of samples with cluster label, the set of cluster sets, the set of samples with true label and the set of class label sets, the Normalized Mutual Information is given as
with mutual information and entropy .
Embedding Space Density. Given sets of embeddings , we first define the average inter-class distance as
which measures the average distances between groups of embeddings with respective classes and
, estimated by the respective class centers. denotes a normalization constant based on the number of available classes. We also introduce the average intra-class distance as the mean distance between samples within their respective class
again with normalization constant and set of embeddings with class , . Given these two quantities, the embedding space density is then defined as
and effectively measured how densely samples and classes are grouped together. Roth et al. (2020) show that optimizing the DML problem while keeping the embedding space density high, i.e. without aggressive clustering, encourages better generalization to unseen test classes.
Spectral Decay. The spectral decay metric defines the KL-divergence between the (sorted) spectrum of singular values
(obtained via Singular Value Decomposition (SVD)) and a
-dimensional uniform distribution, and is inversly related to the entropy of the embedding space:
It does not account for class distributions. Roth et al. (2020) show that doing DML while encouraging a high-entropy feature space notably benefits the generalization performance. In our experiments, we disregard the first 10 singular vectors (out of 128) to highlight the feature diversity. This is important, as we evaluate the spectral decay within the same objectives, which results in the first few singular values to be highly similar.
This part extends the set of ablations experiments performed in section 5.4 in the main paper.
a. Detaching target spaces for distillation. We examine whether it is preferable to detach the target embeddings from the distillation loss (see eq. 3), as we want the reference embedding space to approximate the higher-dimensional relations. Similarly, we do not want the target embedding networks to reduce high-dimensional to lower-dimensional relations to optimizer for the distillation constraint. As can be seen in fig 4C, it is indeed the case that detaching the target embedding spaces is notably beneficial for a stronger reference embedding, supporting the previous motivation.
b. Influence of varying target dimensions. As noted at the beginning of section 4, we set the target dimension for dual self-distillation (DSD) to , which we motivate through a small ablation study in fig. 4A, with TD denoting the target dimension of choice. As can be seen, benefits plateau when the target dimension reaches more than four times the reference dimension. However, to be directly comparable to high-dimensional reference settings, we set as default.
c. Ablating multiple distillation scales. Going further, we extend the module with additional embedding branches to the multiscale self-distillation approach (MSD), all operating in different, but higher-than-reference dimension. As already shown in Figure 3B in the main paper, there is a benefit of multiscale distillations by encouraging reusable sample relations. In this part, we motivate the choice of four target branches (as noted in sec. 4). Looking at figure 4A, where denotes the number of additional target spaces, we can see a benefit in multiple additional target spaces of ascending dimension. As the improvements saturate after , we simply set this as the default value. However, the additional benefits of going to multiscale from dual distillation are not as high as going from no to dual target space distillation, showcasing the general benefit of high-dimensional concurrent self-distillation. Finally, we highlight that a multiscale approach slightly outperforms a multibranch distillation setup (Fig. 4A, Multi-B) where each target branch has the same target dimension of while introducing less additional parameters.
d. Finer-grained feature distillation. As already shown in section 4 and again in figure 4B, we see benefits of feature distillation, using the (globally averaged) normalized penultimate feature space. It therefore makes sense to investigate the benefits of distilling even more fine-grained feature representation. Defining as the pooling window size applied to the non-average penultimate feature representation, we investigate less compressed feature representation space. As can be seen in fig. 4B, where denotes the index to , there appears to be no benefits in distilling feature representations higher up the network.
e. Runtime comparison of base dimensionalities. We highlight relative retrieval times at different base dimensionalities in Tab. 3 using faiss (Johnson et al., 2017) on a NVIDIA 1080Ti and a synthetic set of embeddings of dimensionality . With S2SD matching to base dimensionalities (see §5.3), runtime can be reduced by up to a magnitude.
To facilitate reproducibility, we provide pseudo-code based on PyTorch (Paszke et al., 2017).
|Distillation||Best Teacher (d=1024)|
|Base Student (d=128)|
|Distill Student (d=128)|
|Concur. Student (d=128)|
|Joint Student (d=128)|
|Embedding Dimensionality||Base (d=16)||MSD (d=256)|
|Basic (d=32)||MSD (d=512)|
|Basic (d=64)||MSD (d=1024)|
|Basic (d=128)||MSD (d=2048)|
|Basic (d=256)||MSDA (d=16)|
|Basic (d=512)||MSDA (d=32)|
|Basic (d=1024)||MSDA (d=64)|
|Basic (d=2048)||MSDA (d=128)|
|DSD (d=16)||MSDA (d=256)|
|DSD (d=32)||MSDA (d=512)|
|DSD (d=64)||MSDA (d=1024)|
|DSD (d=128)||MSDA (d=2048)|