Spherical Feature Transform for Deep Metric Learning

08/04/2020 ∙ by Yuke Zhu, et al. ∙ Megvii Technology Limited 0

Data augmentation in feature space is effective to increase data diversity. Previous methods assume that different classes have the same covariance in their feature distributions. Thus, feature transform between different classes is performed via translation. However, this approach is no longer valid for recent deep metric learning scenarios, where feature normalization is widely adopted and all features lie on a hypersphere. This work proposes a novel spherical feature transform approach. It relaxes the assumption of identical covariance between classes to an assumption of similar covariances of different classes on a hypersphere. Consequently, the feature transform is performed by a rotation that respects the spherical data distributions. We provide a simple and effective training method, and in depth analysis on the relation between the two different transforms. Comprehensive experiments on various deep metric learning benchmarks and different baselines verify that our method achieves consistent performance improvement and state-of-the-art results.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is crucial to have sufficient data diversity in deep metric learning. A common practice is to augment data in the image space. This is effective but has limited effect. Specifically, it is hard to generate variances in one class using the information in the other classes.

Directly augmenting data in the feature space has become a new trend [duan2018deepAdversarial, zhao2018adversarialApproach, radford2015unsupervised, lin2018deepVariational, featureTransferLearning, liu2018featureSpaceTransfer, zheng2019hardnessHDML]. Specifically, Yin _meaning:NTF . .etal _catcode:NTF a .etal. .etal. [featureTransferLearning]

propose a simple method that requires no extra labeling and is easy to implement. It assumes that the example features in each class follow a Gaussian distribution, and the covariance between all classes is the same, thus shared. Each feature is the summation of the class-dependent mean and a class-independent variance. Thus, given existing features in one class, their variance parts can be transferred to generate

new features in other classes, via a translation. This is illustrated in Fig. 1(a). It is shown effective in [featureTransferLearning].

Recently, feature normalization is widely adopted in deep metric learning [ranjan2017l2L2face, wang2017normface, wang2019ranked, wang2019multi, wang2018cosface, deng2019arcface]. In this case, all features lie on the surface of a hypersphere. The feature transfer approach [featureTransferLearning] becomes inappropriate. First, a Gaussian distribution is no longer correct. A proper spherical distribution should be used instead. Second, although each class can be approximated as a local Gaussian on the sphere, the assumption of identical covariance between classes is less valid. Last, feature translation would produce an invalid feature that is out of the surface of the hypersphere, as shown in Fig. 1(b). Therefore, both the prior and the feature transform should be adapted for the spherical case.

(a) (b)
Figure 1: Illustration of two feature transforms. (a) translation transform [featureTransferLearning]. The feature of and are sampled from Gaussian distributions with mean value , and identical covariance. To increase the intra-class variances of , feature is generated by translating by . (b) Illustration of translation transform and SFT on a sphere. Directly translating from to will result in , which is out of the surface of the sphere. Our spherical feature transform performs a rotation, such that feature of is transferred to of .

This work proposes spherical feature transform to resolve above problems. It assumes that distributions of features of different classes are spherical-homoscedastic [hamsici2007spherical]. This relaxes the previous assumption that identical covariance between classes. Instead, it assumes all classes have similar

covariances, where the similarity is measured by equivalence of eigenvalues of the covariance matrices. Consequently, the transformation between two classes is a rotation that is characterized by the classes’ means. This is illustrated in Fig. 

1(b). Theoretical analysis reveals that our approach is a generalization of [featureTransferLearning].

Our method is simple and general. It is validated on several deep metric learning tasks. Comprehensive experiments and ablation studies demonstrate its effectiveness.

2 Related Work

Feature augmentation is a relatively new topic. Some researchers [duan2018deepAdversarial, zhao2018adversarialApproach, sohn2017unsupervised, zheng2019hardnessHDML]

adopt an adversarial approach to generate hard features from the observed negative samples utilizing the Generative Adversarial Networks (GAN) 

[goodfellow2014generativeGAN]. Their main focus is to generate hard negative features. While the structure of feature distributions is not considered. Also, the training process with GAN is usually complicated and unstable [brock2018largeGANTraining]. Dixit _meaning:NTF . .etal _catcode:NTF a .etal. .etal. [dixit2017aga] propose a data augmentation method using attribute-guided feature descriptor for generation. Liu _meaning:NTF . .etal _catcode:NTF a .etal. .etal. [liu2018featureSpaceTransfer] propose to learn a pose manifold in the feature space and use it to synthesize pose-augmented features. However, these works need extra labeling for supervision.

Recently, Lin _meaning:NTF . .etal _catcode:NTF a .etal. .etal. [lin2018deepVariational] utilize the variational inference to disentangle intra-class variance and leverages the distribution to generate discriminative samples to improve robustness. This work and ours share similar insight that the variances of different class can be regarded as similar. But their method is based on the assumption that the variances can be fully disentangled and can be modeled using a Gaussian. While our method makes no assumptions about this. In fact, we will show that when features are on a hypersphere, the intra-class variances can not be modeled using one distribution. The most similar work to ours is in [featureTransferLearning]. This work also models the variances using a Gaussian. It proposes to transfer the variance part from one class to the other for feature augmentation. It will be detailedly introduced in Sec 3.1. However, both two works do not considering the widely adopted feature normalization techniques and its influence on feature distributions.

3 Proposed Approach

3.1 Review of Feature Transform

Feature transform is an approach for feature generation by transferring the intra-class variance from one class to the others. It is based on the assumption that features from each class follow a Gaussian distribution and the distributions of different classes have different mean values but shared covariances. Using this assumption, a feature is represented by two parts:


where is the mean value of the class that belongs to. is the variance part sampled from a zero-mean Gaussian. contains the information of identity of the class. contains the information of intra-class variance that is shared among classes.

Following this prior, Feature Transfer Learning (FTL) 

[featureTransferLearning] is proposed to transfer the variance part from one class to the others for feature generation. Specifically, given a feature with and the center of a target class . The feature generation is proceeded by , where is regarded as belonging to the target class but shares identical variance with . We illustrate this process in Fig. 1(a). The feature transform can also be written as


It can be interpreted as translating the feature by . Thus, this method is referred to as translation transform.

3.2 Review of Spherical-homoscedasticity

Spherical-homoscedasticity is a property describing the relationship between a set of data distributions on the sphere, which we refer to as spherical distributions. It is proposed by Onur C _meaning:NTF . .etal _catcode:NTF a .etal. .etal. [hamsici2007spherical].

The definition of spherical-homoscedasticity resorts to the Gaussian approximation. We first give the definition of Gaussian approximation and then give the definition of spherical-homoscedasticity.

Definition 1. Suppose is a sample from the spherical distribution. Then the Gaussian approximation is given as , where and are the functions for expectation and variances.

Definition 2. Suppose distribution is the Gaussian approximation of spherical distribution and

is an orthogonal matrix. Suppose

is spanned by

and one of the eigenvectors of

. Suppose is the Gaussian approximation of spherical distribution . Then and ( and ) are spherical-homoscedastic.

Spherical-homoscedasticity requires the covariances of distributions to have identical eigenvalues. Geometrically, this property indicates that distributions share identical shape. In other words, distributions can be transformed to be totally overlapped.

3.3 Spherical Feature Transform

Recently, feature normalization has been widely discussed [ranjan2017l2L2face, wang2017normface, wang2018cosface] and adopted in DML frameworks [wang2019ranked, wang2019multi, wang2018cosface, deng2019arcface]. This technique scales all the features to the same norm. Thus, the features are restricted to lie on the surface of a hypersphere. In this case, the feature transform in Eq. 2 is no longer valid. There are two reasons. First, the identical-variance prior is too restrictive for spherical distributions. In general(e.g. the two distributions in Fig. 1(b)), spherical distributions are unlikely to have the same covariance. Second, translation transform produces features lying out of the surface of the hypersphere. This breaks the manifold structure of the feature space as shown in Fig. 1(b). Therefore, both the identical-variance prior and the translation transform should be modified for the spherical case.

We propose a new approach. It relaxes the identical-variance prior to the prior of identical eigen values of variances, which is the spherical-homoscedasticity as defined in Sec. 3.2. This relaxation is validated Fig. 2

. The experiment is performed on CUB dataset (see experiments for details). We choose four classes with sufficient number of samples so that their feature distributions can be faithfully estimated. We compare their covariance matrices and the eigenvalues. As shown in Fig. 

2(b), their covariance matrices are significantly different, but the difference of the eigen values of these covariance matrices are much smaller (about 8% on average) as shown in Fig. 2

(c). This shows that the identical-variance prior does not hold. And our assumption of identical eigenvalues of covariances is more valid. The similar observation is also found on other datasets in face recognition, vehicle recognition and etc.

(a) (b) (c)
Figure 2: (a) Visualization of features on CUB dataset. Features are projected to 3D using PCA. (b) Diagonal elements of four classes’ variances from CUB. The values from the same position on the diagonal are plotted together. They differ a lot. (c) Eigen values of four classes’ variances from CUB. The eigen values from the same position on the eigen matrices are plotted together. They are much closer.

Geometrically, our assumpion implies that a distribution can be transformed to overlap with another via an orthogonal rotation matrix as in the Definition 2. Thus, a feature vector in one class can be transformed to another class to generate augmented features. We denote the Gaussian approximation of two classes distributions as

and . Given a feature sampled from , we have:


where is considered as belonging to the class of . This method is called Spherical Feature Transform (SFT).

However, we note that solving the orthogonal matrix according to Definition 2. is non-trivial. A brute force approach would be complex. We propose a simpler and more elegant approach to calculate without solving matrix equations. It is presented in the Proposition 1.
Proposition 1. Suppose and are two Gaussian approximations of spherical distributions. If they are spherical-homoscedastic, then the rotation matrix between them is spanned by and .

The proof of Proposition 1 is left in the supplement. The rotation matrix is calculated as following. First, we apply Schmidt orthogonalization to obtain and : , . Then, we use Rodrigues rotation formula to calculate the rotation matrix:



is the identity matrix and

is the rotation angle between and .

3.4 Theoretical Analysis

We discuss the relation between proposed SFT in Eq.(3) and the translation transform in Eq.(2). In general, the two transforms are different. However, we show that a simple variant of the translation transform (for normalized features) under some special cases is a degenerated form of SFT. Actually, we use this variant as a baseline method in our experiment.

In translation transform, the variance part defined in Eq.( 1) is assumed to have the same distribution among all the classes. Differently, we propose SFT by showing that this term should be orthogonal transformed when features are normed. We observed that, when well trained, the features sampled from are likely to lie in the invariant subspace of the orthogonal matrix , as defined in Definition 2. This observation is experimentally validated in Sec LABEL:sec:exp_degeneration. We show that in this case SFT degenerates to translation transform defined in Eq. 2. With this condition , Eq. (3) is simplified as


The right side is the translation transform in Eq.(2).

This degeneration case is a bit hard to understand, especially in high dimensional space. For an intuitive illustration, we show an example in three-dimensional space. As shown in Fig. 3(a), in general, the result of SFT is not equal to . While, some special features stay equal after rotation and translation, as shown in Fig. 3(b). In such a case, the direction of is parallel to the rotation axis of . That is, lie in the invariant subspace of .

(a) (b)
Figure 3: Illustration of the degeneration from SFT to translation transform by taking a three-dimensional example. The , , are three axis of the coordinate. The red and blue ellipses represents the distribution of and . Suppose they are spherical-homoscedastic and the rotation matrix between them is and (a) In general is not equivalent to . (b) Special case: The intra-class variances are now encoded by the one-dimensional space spanned by and the is equivalent to .

Proposition 2. The degeneration happens only before feature normalization. Proof Suppose feature and its variance part lie in the invariant subspace of . As is spanned by and one of the other vector, is orthogonal to the invariant subspace of . So is orthogonal to . Then the norm of is evaluated as:


As is a constant for one class and the varies, the norm of can not be a constant for each features of a class. In other words, the feature norms are not constant.

Based on Proposition 2 , we can make a simple modification to the translation transform to make it able to produce valid features in spherical case. Specifically, we use Eq. 2 before feature normalization and then reproject them back to the hypersphere. This variant is referred to as the degenerated form of SFT.

However, the degenerated form will produce identical augmented features as SFT only when degeneration takes place. There are still features that won’t obey the condition of degeneration. Directly applying the degenerated form on them may curse the augmentation process. We further investigate into whether there is an ideal case where the degeneration will always take place thus the degenerated form can be treated as an alternative of SFT. Considering the condition of degeneration, this special case should satisfy for any rotation matrix and any . The three-dimensional example in Fig. 3(b) gives a clear clue that this special case exist mathematically. Specifically, if the feature distributions are shrunk in the plane defined by , then all will lie in the invariant subspace of . The exact mathematical description for such case is presented in Proposition 3
Proposition 3. SFT degenerates to the translation transform iff for feature with ,

The proof is presented in the supplement. Proposition 3 has revealed a extremely restrictive condition that the mean vectors and

lie in two orthogonal subspaces. Intuitively, this condition is hard to be satisfied. While surprisingly, it is found, although not clear why, but in general, that deep neural networks tend to learn an orthogonal subspaces for

and . For revealing this phenomenon, we define a measure of how much the the subspaces of and are orthogonal. We first define two covariance matrices:


where is the label of embedding . is the number of classes. is the number of samples for -th class. Then, we estimate the eigenvalue space for and denote them as , where corresponds to the largest eigenvalues. The subspace spanned by will cover most energy of the mean vectors while the energy of will distribute over these components. We calculate the remaining energy percent of in this subspace by evaluating:


where is the sum of the diagonal elements. measures that how much energy percent of is distributed over the subspace spanned by . is between 0 and 1. If , then the subspaces for and are orthogonal. So the smaller , the nearer of the state being orthogonal.

3.5 Training Scheme

Both the translation transform defined in Eq. 2 and SFT defined in Eq. 3 rely on the accurate estimation of the feature center of each class. We denote the feature centers as where is the number of classes. In every mini-batch, we update them by:


where is the label of feature and is the mini-batch size. is the indicator function. For training, we propose two train schemes depending on the whether the training set is balanced.
Balanced train. When the dataset is balanced in the number of samples for each class, we will generate new features for every class. In specific, for a feature, we randomly choose a different class as target and transform the feature to that class. We do this for every feature in the mini-batch. After that, we get a new batch of features with different labels.
Unbalanced train. When the dataset is unbalanced in the number of samples for each class, we only generate new features for classes that are short of samples. In specific, we set a threshold for the number of samples and use it to separate the whole training data into head classes and tail classes. For any head features in a mini-batch, we randomly choose a tail class as target and transform it into the tail distributions.

For both training schemes, we get two batch of training data. Let be the original features and be the corresponding labels, where . Let be the generated batch and be the corresponding labels. As our augmentation method is applicable to any DML frameworks, we denote as a general target function with denoting the parameters to be optimized and , denoting the batch data and labels. Similar to DVML [lin2018deepVariational] and HDML [zheng2019hardnessHDML], we also apply the metric learning losses on the original features besides the augmented features. It is because that the augmentation process relies on a well trained feature space. Omitting the original features or applying too much weight on the augmented features will curse the training process. It is shown in Sec 4.2. We formulate our losses as:


where is a weighting factor controlling the balance between the original batch data and the generated batch data. The total training scheme for feature transform is illustrated in Algorithm 1.

1:Training image set, network , target function , parameters and number of iteration numbers T.
2:Parameters of network
4:for  do
5:     Sample mini-batch of training images.
6:     Extract embeddings using to get with labels .
7:     Produce data using (3) or (2).
8:     Update geometric centers using (9).
9:     Optimize using (10).
10:end for
Algorithm 1 Training with Feature Transform

4 Experiments

Datasets and Metrics.

We conduct experiments on two types of benchmark datasets: Metric Learning and Face Recognition. For metric learning, we experiment on three widely-used benchmarks to evaluate the our approach: (1)Cars196 [krause20133dCARS], (2)CUB-200-2011 [wah2011caltechCUB], (3)Stanford Online Products (SOP[oh2016deepLifted]. To evaluate the performance of each method, we follow [duan2018deepAdversarial]

to perform the K-means algorithm in the test set and report normalized mutual information (NMI) and

metrics as well as Recall@K for retrieval task. For face recognition, we use a cleaned version of MS-Celeb-1M [guo2016ms1m] as our training set that contains 3M facial images and 80920 classes. We present evaluation results on three face verification benchmarks: LFW [huang2008labeledLFW], YTF [wolf2011faceYTF] and IJB-C [maze2018iarpaIJB-C]. For LFW and YTF, we follow the unrestricted with labeled outside data protocol and report the performance of 6,000 face pairs on LFW and 5,000 video pairs on YTF. For IJB-C, we follow the 1:1 verification protocol to evaluate 19,557 positive matches and 15,638,932 negative matches and report the results of TARs at various FARs.

Implementation Details.

For the metric learning task, we use GoogleNet [googlenet]

(or GoogleNet-V2) pre-trained with ImageNet 

[imagenet_cvpr09] as a backbone network and add a fully connected layer at the end to output the feature embedding. We use the same data preprocessing and augmentation as in Multi-Similarity Loss  [wang2019multi]. We set the embedding size to 512 and perform -normalization on the feature. We use the SGD optimizer with a weight decay of 1e-4 and train for 30,000 iterations. For learning rate, we set 1e-2 for Cars196 and SOP and 1e-3 for CUB-200-2011 as base learning rate for backbone and newly added layers 10x the base learning rate, and decay the learning rate by multiply 0.1 every 10,000 iterations. We set the batch size to be 60 made up of 20 classes and 3 images per class. The balanced train scheme is adopted when SFT is used. For face recognition, the CNN architecture used in our work is similar to [liu2017sphereface]. We change the number of residual units to to construct a 34-layer residual network. We preprocess all face images by MTCNN [zhang2016jointMTCNN]. Then the 5 facial points are adopted to perform alignment to the face image. After that, we resize the cropped image to . Each pixel(in [0, 255]) in RGB images is normalized by subtracting 127.5 then being divided by 128. We use SGD optimizer with a weight decay of 5e-4 and train for 120K iterations. The learning rate is set to 0.1 initially and is divided by 10 at the 70K, 90K and 110K iterations. The unbalanced train scheme is adopted when SFT is used, where we set the classes that have less than 15 samples as tail classes.

Cars196 CUB-200-2011
R@1 R@2 R@4 MNI F1 R@1 R@2 R@4 NMI F1
Triplet 58.4 70.3 80.2 57.0 27.2 42.8 55.2 55.6 52.4 19.1
Triplet+HDML [zheng2019hardnessHDML] 62.0 73.3 82.9 57.7 27.8 44.3 56.0 68.0 55.5 26.7
Triplet+DVML [lin2018deepVariational] 64.4 73.5 78.6 60.5 28.4 43.3 55.8 68.0 55.0 25.2
Triplet+FTL 60.1 71.5 80.5 57.9 25.0 46.8 59.2 70.2 57.3 24.3
Triplet+SFT-d 60.3 71.7 81.4 57.9 28.1 46.5 59.3 70.0 57.9 28.1
Triplet+SFT 65.1 75.7 84.0 58.1 28.6 48.3 60.0 71.2 58.1 28.6
NPair 72.8 82.3 88.5 61.3 29.4 53.5 64.9 72.3 60.4 27.8
NPair+HDML [zheng2019hardnessHDML] 78.9 87.0 91.0 67.1 37.3 53.9 65.8 76.7 62.0 30.0
NPair+DVML [lin2018deepVariational] 80.2 85.6 91.9 66.1 34.8 54.2 66.2 77.3 62.0 31.5
NPair+FTL 73.1 82.2 88.6 60.0 27.4 54.0 66.0 77.0 61.9 29.7
NPair+SFT-d 76.2 85.0 90.9 64.2 33.1 54.5 67.0 77.7 62.0 30.1
NPair+SFT 79.4 87.1 92.4 67.2 37.3 54.7 67.0 77.5 62.2 30.5
RLL [wang2019ranked] 74.2 83.2 89.0 62.2 32.9 59.6 71.0 80.5 64.3 32.9
RLL+DVML [lin2018deepVariational] 79.0 86.6 91.3 65.5 34.9 60.2 71.7 81.0 64.7 33.0
RLL + SFT-d 78.8 86.7 92.1 65.4 34.4 59.4 71.2 80.9 64.2 32.8
RLL + SFT 80.2 88.1 92.8 66.1 35.3 60.3 71.8 81.1 64.9 33.6
MS [wang2019multi] 84.0 90.2 94.1 72.8 45.3 65.7 76.6 84.6 69.0 39.6
MS+DVML [lin2018deepVariational] 84.4 90.8 92.4 72.0 45.3 66.2 76.7 85.1 69.6 40.0
MS + SFT-d 83.8 90.4 94.6 73.1 45.3 66.1 76.8 85.2 70.0 41.6
MS + SFT 84.5 90.6 94.6 73.2 45.8 66.8 77.5 85.8 70.3 40.4
Table 1: Comparison on Cars196 and CUB-200-2011.

Compared Methods.

We compare our method to other feature generation methods, including HDML [zheng2019hardnessHDML], DVML [lin2018deepVariational] and FTL [featureTransferLearning]. These methods are introduced in Sec 2. They require no extra labeling and can be compared fairly on metric learning tasks. Also the degenerated SFT will be included for comparison. It is denoted as SFT-d in the results. The comparison is made on two traditional representative baseline losses, aka, triplet loss [schroff2015facenet] and NPair loss [sohn2016npair] and two most recent baseline losses that achieved high results, aka, Ranked List Loss (RLL) [wang2019ranked] and Multi-Similarity Loss (MS) [wang2019multi]. Most of the comparison is made on GoogleNet [googlenet] because almost all of the chosen competitors report their results on this backbone. For comparison with the SOTA, we also make some comparison on GoogleNet-V2. For fair comparison, we implement all of these methods and report the results from our experiments.

For FTL [featureTransferLearning], the features are normed in our implementation as we found that feature normalization will outperform the original method greatly. The FTL differs from the degenerated form of SFT in that it requires a pre-training of the network and is applied in the fine-tuning stage while this is not needed in both SFT and the degenerated form. Also, FTL requires a decoder network and only transfers a part of the energy of using PCA. In our implementation, we follows them to use 95%.

4.1 Quantitative Results

Table 1, Table 4, Table 2 and Table 4 present the experimental results of SFT on three popular deep metric learning benchmarks and three face recognition benchmarks respectively.

By comparing with baseline methods, it is noticed that SFT can significantly improve the performance of them, especially on Cars196 and CUB-200-2011. For example, when coupled with NPair loss, SFT improves the baseline by 7 point on Cars196. SFT can also boost performance on higher baselines that reported by two most recent losses, Multi-Similarity loss and Ranked-List loss. While SFT is relatively less effective on SOP(Table 4). The reason is that the number of samples for each class in SOP is too small(about 5).

To sum up, SFT performs better than HDML [zheng2019hardnessHDML], FTL [featureTransferLearning] and DVML [lin2018deepVariational]. For example, when coupled with NPair loss, SFT outperforms the HDML by 1.0 on Cars196.

Method Training Data LFW YTF
DeepFace+ [taigman2014deepfaceplus] 4M 97.35 91.4
FaceNet [schroff2015facenet] 200M 99.63 95.1
DeepID2+ [sun2015deeplyDeepID2] 300K 99.47 93.2
SphereFace [liu2017sphereface] 0.5M 99.42 95.0
CosFace [wang2018cosface] 5M 99.73 97.6
ArcFace [deng2019arcface] 5.8M 99.83 98.02
L2-Face [ranjan2017l2L2face] 3.7M 99.78 96.08
L2-Face [ranjan2017l2L2face] (ours) 3M 99.45 96.0
L2-Face(ours) + SFT-d 3M 99.41 95.9
L2-Face(ours) + SFT 3M 99.50 96.5
CosFace [wang2018cosface] (ours) 3M 99.68 96.2
CosFace (ours) + SFT-d 3M 99.70 96.5
CosFace (ours) + SFT 3M 99.73 97.2
Table 2: Face verification (%) on the LFW and YTF datasets.

On higher baselines, such as Multi-Similarity Loss, our SFT outperforms DVML by 0.7 on CUB. The degenerated form of SFT can be effective on most baseline methods. While averagely, it surpass the performance of SFT. Besides the metric learning losses, SFT can also be used together with softmax-based losses. This is mainly used in face recognition tasks. On LFW dataset and YTF dataset(Shown in Table 2), the performance of deep neural networks are nearly saturated, but we still report the performance for comparison with the other works. On IJB-C(Shown in Table 4), we provide a competitive baseline for both L2-Face [ranjan2017l2L2face] and CosFace [wang2018cosface], while it is observed that SFT can still boost the performance when compared with the baselines.

R@1 R@10 R@100 NMI F1
Triplet (ours) 70.8 85.5 93.8 88.2 28.0
Triplet + SFT-d 71.9 86.4 94.4 88.5 29.3
Triplet + SFT 72.3 86.5 94.5 88.6 29.9
RLL [wang2019ranked] (ours) 77.5 89.9 95.8 89.7 35.3
RLL + SFT-d 77.9 90.3 96.1 89.8 35.9
RLL + SFT 77.8 90.2 96.0 89.9 36.4
MS [wang2019multi] (ours) 73.1 87.2 94.7 88.5 29.6
MS + SFT-d 73.5 87.5 94.9 88.6 29.8
MS + SFT 73.4 87.1 94.7 88.8 30.9
Table 4: Comparison with the state-of-art methods on IJB-C. The ‘-’ denotes the corresponding results are not reported in the original paper.
Method Training Data IJB-C(TAR@FAR)
0.001% 0.01% 0.1%
Vggface2 [cao2018vggface2] 3.3M 74.7 84.0 91.0
L2-Face [ranjan2017l2L2face] 3.3M 78.54 87.01 92.10
Arcface [deng2019arcface] 5.8M - 92.10 -
L2-Face [ranjan2017l2L2face](ours) 3M 79.3 87.3 93.3
L2-Face(ours) + SFT-d 3M 79.4 87.9 93.3
L2-Face(ours) + SFT 3M 80.6 88.2 93.6
CosFace [wang2018cosface] (ours) 3M 85.67 92.11 95.4
CosFace (ours) + SFT-d 3M 86.85 92.78 95.72
CosFace (ours) + SFT 3M 87.19 92.63 95.6
Table 3: Experimental results on Stanford Online Products(SOP). SFT is less effective on SOP as the number of samples for each class is only 5.

4.2 Ablation Study

In this part, we conduct the ablation study on Cars-196 with the ranked list loss. The conclusions from these experiments are also applicable to other datasets and loss functions.

(a) (b)
Figure 4: Effect of SFT on feature distributions. (a) Left: Divergences of each class in baseline and SFT. The blue dashed line represents the average divergence of baseline. Right

: The standard deviation of divergences in baseline and SFT. (b) Histograms of positive(blue) and negative(orange) distance distributions on the Cars196 test set for(from left to right), initial state with pre-trained model, training with ranked list loss, training with ranked list loss together with SFT.

Effect on Feature Distributions We find that SFT can make feature distributions more similar than the baseline method. This is consistent with our prior that feature distributions should be similar to each other. In other words, the SFT can make the eigenvalues of variances from different classes to be closer. In specific, we compare the similarity by comparing the trace of the scattering matrix of each class. We refer to the trace of the matrix as the divergence. The scattering matrix of each class is defined as:


Fig. 4(a) shows the divergences of each class, and the standard deviation of the divergences. The class IDs are sorted according to the divergences of the baseline. For clarity, one for every four values is chosen to shown in the histogram. It is observed that the divergences among classes are more balanced when SFT is applied. In general, the divergences below the average(blue dashed line) are increased and those above the average are decreased. The right part of Fig. 4 displays the standard deviations of the divergence values. It is consistent with the conclusion.

Furthermore, the distributions of pair distance are compared. It is shown in Figure 4(b). It is observed that the overlap between the positive parts and negative parts is reduced when SFT is applied. This indicates that SFT helps the network to learn a more discriminative feature space.

Baseline Balanced Train Unbalanced Train
Head 95.61 95.77 95.49
Tail 73.91 78.35 82.32
Table 5: Effect of SFT on an unbalanced dataset. Head represents classes with rich samples. Tail represents classes in short of training samples. The results are the classification accuracy(%).

Effect on Unbalanced Datasets The face recognition datasets differ from DML datasets in that they are usually long-tailed. Among then, plenty of classes are in short of samples. These classes are usually called the tail classes. Experimentally, we find SFT can improve the performance of tail classes. In specific, we select all classes in MS-Celeb-1M that contains more than 100 samples to construct a mini-dataset. In total, we get 2,445 classes. Then, we random choose 1,500 classes to be the head classes and choose 50 samples each for training. For the remaining 945 classes, we treat them as the tail classes and choose 5 samples each for training. All the other samples are left for testing. As the training set is much smaller than that of MS-Celeb-1M, we adopt a smaller network for training. In specific, we use a similar CNN architecture for training except that we change the number of residual units to . The results are shown in Table 5. We can see that the baseline method performs worst in the tail classes. But when SFT is applied, the performance of the tail classes is increased by a large margin. The best performance in tail classes is achieved by the unbalanced train scheme, which outperforms the baseline by 8.4%, while the performance drop in the head classes is negligible. In summary, the SFT can effectively improve the accuracy of the tail classes.

baseline SFT Random Pick
R@1 74.2 80.2 74.4 74.6
Table 6: Impact of center estimation.

Impact of Center Estimation

As the rotation matrix of SFT is estimated based on feature centers, the center estimation is essential. To evaluate the importance, we compare the image retrieval performance under three circumstances: (1) “Random”, skip the center estimation step in line 6 of Algorithm 

1. (2)“Pick”, randomly pick one sample from the same class as center. (3)SFT, the standard SFT procedure. The results are shown in Table 6. The performances of SFT are almost the same as the baseline when Random or Pick is adopted. While only when SFT is adopted will the performance be improved by a noticeable number. This illustrate that the accurate estimation of class centers is crucial for feature transform. Moreover, it is noticed that even when the center estimation is not accurate, the feature transform will not harm the training too much. This suggests that training with feature transform is stable.

Figure 5: Performance with different Left: batch size; Right .

Batch Size. The batch size is usually important in deep metric learning as it determines the number of positive pairs and negative pairs used for constructing target loss. While when implemented with our method, the number of positive pairs and negative pairs are enlarged. We then conduct experiments on different batch sizes to “fairly” compare the performance under an equal number of positive and negative pairs. The comparison results are shown in the left part of Fig. 5. It is observed that SFT can beat the baseline with the largest batch size 240 even evaluated under a small batch size 30. This suggests that the improvement when SFT is applied is not due to the increase of batch size.

Effect of . We conduct experiments to explore the influence of the weight factor. As shown in the Fig. 5, when increasing the , the performance of the method first increases and then decreases. When the is too large, the performance drops significantly. We blame the performance drop to that the gradients from the generated features will dominate the optimization process and infect the optimization of the regular ones. In practice, the optimal is data-dependent. We do not investigate into what is the optimal value of . In most of our experiments, the value is set to 0.2.

Discussion of Degeneration. In Sec 3.4, it is hypothesized that defined in Eq. 1 is likely to lie in the invariant subspace of . To investigate into whether the hypothesis holds, we evaluate the value of on five versions of ResNet. The distribution of on ResNet50 is shown in Fig. 6(a). It is noticed that a large number of values are near zero. For these features, the augmented features by SFT and the degenerated form are close. For each backbone, is evaluated times and the mean value is reported. The result is shown in Fig. 6(b). As observed, the hypothesis is more likely to hold in deeper networks. This implies that the degeneration of SFT will be more likely to happen when the network gets deeper.

Our experiment also reveals that the learned subspaces for and tend to be orthogonal. This is presented in Fig. 6(c). For example, on the backbone of ResNet50, only distribute 10% energy on the subspace that covers 99% energy of . It means that, although the ideal condition in Proposition 3 can not be reached, the learned feature space tend to approach it. These experimental results support our analysis that the degeneration of SFT happens for most features. Considering the comparison shows that the SFT will outperform the degenerated form in most scenarios, the side effect of the degenerated form on features that won’t degenerate should not be neglected.

Figure 6: Experiment on the condition of the degeneration. (a) The distribution of sampled from features of backbone ResNet50. (b) The mean values of the distributions of from different backbones. (c) defined in Eq. 8 with respect to the energy of .

5 Conclusion

In this paper, we propose Spherical Feature Transform (SFT) to generate new features from existing ones. The proposed SFT can effectively enrich the intra-class variances of both regular classes and under-represented ones. We have demonstrated the effectiveness of SFT by applying it to several most recent DML frameworks in three popular deep metric learning benchmark datasets and three face recognition benchmark datasets.


This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700800.