Project page for Heated-up Softmax Embedding
Metric learning aims at learning a distance which is consistent with the semantic meaning of the samples. The problem is generally solved by learning an embedding for each sample such that the embeddings of samples of the same category are compact while the embeddings of samples of different categories are spread-out in the feature space. We study the features extracted from the second last layer of a deep neural network based classifier trained with the cross entropy loss on top of the softmax layer. We show that training classifiers with different temperature values of softmax function leads to features with different levels of compactness. Leveraging these insights, we propose a "heating-up" strategy to train a classifier with increasing temperatures, leading the corresponding embeddings to achieve state-of-the-art performance on a variety of metric learning benchmarks.READ FULL TEXT VIEW PDF
The extraction of useful deep features is important for many computer vi...
Softmax is a standard final layer used in Neural Nets (NNs) to summarize...
This paper studies the unsupervised embedding learning problem, which
Recently, speaker embeddings extracted from a speaker discriminative dee...
Semantic scene parsing is suffering from the fact that pixel-level
Previous work shows that adversarially robust generalization requires la...
Following the rapidly growing digital image usage, automatic image
Project page for Heated-up Softmax Embedding
Metric learning aims at learning a metric space in which the samples from the same classes are close (compact) and the samples from different classes are far away (spread-out in the space) (Hoffer & Ailon, 2015; Schroff et al., 2015; Zhang et al., 2017)2003; Ye et al., 2016)2008; Zhang et al., 2017)2009; Schroff et al., 2015), and person re-identification (Koestinger et al., 2012; Lisanti et al., 2017).
One solution for this problem is to define a loss function that enforces the properties of compactness and spread in the metric space. Two of the most popular loss functions are the contrastive loss(Chopra et al., 2005) and the triplet loss (Hoffer & Ailon, 2015). However, both losses face challenges in sampling, as usually there are a very large number of possible pairs or triplets in one dataset.
To overcome the sampling issue, a variety of hard mining strategies (Schroff et al., 2015; Mishchuk et al., 2017; Harwood et al., 2017) have been proposed. However, sampling the “hardest” samples (samples that most violate the predefine rules) often leads to poor local minima. Incorporating too many “easy” samples renders the training inefficient. Thus, designing a structured loss to perform hard-mining effectively and efficiently has become a hot research topic (Song et al., 2017, 2016).
Using features from the second last layer (a.k.a. bottleneck layer) of a deep neural network trained as a classifier with the softmax function and the cross-entropy loss works well for many metric learning based applications (Razavian et al., 2014) such as image retrieval (Babenko & Lempitsky, 2015) and face verification (Liu et al., 2017). However, the goals of classifier training and metric learning are different. The former aims at finding the best decision function while the latter is to learn an embedding such that embeddings of samples of the same category are compact while those of samples of different categories are “spread-out”. This motivates us to investigate the relation between metric learning and classifier training.
In this paper, we show that the temperature parameter in the softmax function, defined by Hinton et al. (2015) for knowledge transfer, plays an important role in determining the distribution of the embeddings from the bottleneck layer. Based on the observed relation, we propose to learn a classifier with an intermediate temperature at the beginning and increase the temperature during training. Compared to the state-of-the-art methods in deep metric learning, the proposed “heating-up” method achieves significantly better performance for most cases and at least comparable performances for the rest.
The main contributions of this paper are that:
we study the gradient of the softmax layer and show how the temperature parameter affects the final distribution of the embedding from the bottleneck layer;
we propose a “heating-up” method that can be used to obtain an effective embedding with much better or at least comparable performance compared to state-of-the-art methods in deep metric learning.
Siamese networks with contrastive loss (Chopra et al., 2005) was one of the earliest attempts to solve the metric learning problem. By sampling either two data samples from the same category (positive pair) or two different categories (negative pair), contrastive loss tries to pull two points from positive pair together and push away the points from negative pair. Triplet loss (Hoffer & Ailon, 2015) further requires a margin between the distance of the positive pair and the distance of the negative pair. One of the main issues of contrastive and triplet losses is that the number of possible pairs or triplets is extremely large for a large dataset.
A reasonable solution to address the sampling issue is mining samples that are the most informative for training, also known as “hard mining”. There is a large body of works addressing this problem (Schroff et al., 2015; Mishchuk et al., 2017; Harwood et al., 2017; Yuan et al., 2017; Wu et al., 2017). Semi-hard mining (Schroff et al., 2015) tries to find triplets in a training batch, for which the distance of the positive pair and the distance of the negative pair are within a certain margin. HardNet (Mishchuk et al., 2017) is designed to mine some of the hardest triplets within one training batch.
Designing structured losses to consider all the possible training pairs or triplets within one training batch and perform “soft” hard mining can be an alternative solution for hard mining (Song et al., 2017, 2016; Ustinova & Lempitsky, 2016). Lifted structured loss (Song et al., 2016) exploits all triplets in a training batch and provides a smooth loss function for hard mining. A few deep clustering based losses (Law et al., 2017; Song et al., 2017) have also been proposed to solve the problem. Proxy NCA (Movshovitz-Attias et al., 2017) proposes to learn semantic proxies for training data and use a NCA loss for training. Applying hard mining with proxies is more efficient than with samples.
In face verification, quite a few works have shown that training a classifier and using the output of the second last layer as embedding performs reasonably well (Wang et al., 2017b). NormFace (Wang et al., 2017a) and SphereFace (Liu et al., 2017) suggest to
-normalize both the embeddings and classifier weights. In order to achieve promising results, a learnable or fixed scalar is usually required to be multiplied to the final logits(Ranjan et al., 2017). There are some very preliminary discussions (Wang et al., 2017a; Ranjan et al., 2017) about how this scalar influences the final embedding.
This paper shows that the scalar can be seen as the temperature parameter of the softmax function in Hinton et al. (2015). We analyze how this temperature parameter controls the distribution, especially the compactness, of the embedding by assigning different gradients to different samples depending on their positions w.r.t. the boundary of the classifier. Inspired by these findings, we propose a “heating-up” strategy for training the embedding, which uses increasing temperature while training the classifier. The proposed method makes the embedding trained with the softmax function and the cross entropy loss achieve comparable or better performance than state-of-the-art deep metric learning methods.
Given a set of training samples with label , where is the data sample, being the number of dimension for the training data, and is the category label of sample , being the number of categories for training samples, we try to learn an embedding function
, which maps a data sample to a vector in, such that for all with , , where is a distance function.
We call the embedding of the data sample and use as to simplify the notation. Considering training a linear classifier and ,
is called the logits. The probability that samplebelongs to category can be predicted by the softmax function as:
, which is normally set to 1, is the temperature as mentioned in Hinton et al. (2015). We set as the reciprocal of the temperature to simplify the notation in the paper.
Assuming the ground-truth distribution of the training sample is , generally is a Dirac delta function, which equals 1 if and 0 otherwise, where is the ground-truth label of , the cross entropy loss with respect to , and its gradient with respect to are defined as:
Considering , we have:
The proposed “heating-up” idea is based on an observation that different values will assign gradients with different magnitudes to different samples and thus change the distribution of the final embeddings. To show this, since , the norm of the embedding and the norm of the classifier weights are all affecting the softmax function (Eq.(1) and (3)), we follow Wang et al. (2017a); Ranjan et al. (2017) and -normalize the classifier weights and feature to have unit norm. The normalized features and weights are denoted as and respectively.
The following part of this section will first discuss how changes the gradient assignment (Sec. 3.1) and final embedding (Sec. 3.2) with -normalized embeddings and weights, then the effect of normalization (Sec.3.3). Finally, we also discuss the embedding and gradient properties for an off-the-shelf classifier in 3.4.
height 70pt depth 3pt width 1pt height 70pt depth 3pt width 1pt
In this subsection, we will show how training a deep classification network with different values affects the gradients of different training samples. From Eq. (1), when , satisfies:
where is the number of logits whose value equals the maximum logits value. On the other hand, if
approaches 0, the predicted probability will approach the uniform distribution. In other words, asincreases, the predicted probability will become more “spiky” at the logits that have the largest value.
We define 2 types of training samples as in Fig. 1. In the figure, all data samples (crosses, triangles and circles) belong to the same category. All the samples which are in the area marked by blue dashed lines will be classified as the correct category by the classifier. The training samples can be divided into: (i) “Hard” samples (crosses): samples that are not classified as the correct category (); (ii) “Easy” samples: samples being correctly classified by the classifier (). There are two subtypes of samples in “Easy” category: “Boundary” samples (triangles) are samples close to the decision boundary; “Centroid” samples (circles) are samples laying close to the center of the region that belongs to the category.
The gradient of the loss w.r.t. the normalized embedding (i.e. Eq. (3) with and instead of and ) contains terms in the sum, one term for each of the categories. There is types of terms: type 1, term with respect to the ground-truth category; type 2, term with respect to the logits that has the largest value and does not belong to the ground-truth category; type 3, other terms. We study the behavior of these terms for very small and very large for “hard” and “easy” samples. First of all, it’s easy to see that for , the magnitude of the gradient of any sample will approach 0.
For , considering “hard” samples first, the magnitude of the type 1 term () will approach as , because will approach either 0 or ()111 can’t be 1, otherwise is an “easy” sample. and . Similarly, the magnitude of the term of type 2 will also approach . For other terms, due to the property of the exponential function, that if , , the magnitude of any term of type 3 will decrease to 0. Therefore, for “hard samples”, as , unless in some special cases222For example, two terms having exactly the same magnitude but opposite direction., the magnitude of the gradient with respect to the normalized embedding will approach infinity. Fig. 2(a) shows the magnitude of the gradient with respect to the embedding when changes for three different “hard” samples derived from the network in Sec.4.
Considering the term of type 1 for “easy” samples, since and , the magnitude for this term will approach . For other terms, the magnitude will also approach . Therefore, the magnitude of the gradient will always approach . Fig. 2(b) shows how the magnitude of the gradient with respect to the embedding changes with for three different “easy” samples.
Overall, with large , the magnitudes of the gradients for hard samples will become very large, while the magnitudes of gradients for an easy samples will become very small (Fig. 1(c)). While, with small , the network will assign gradients of similar magnitudes to all the samples (Fig. 1(a)). Choosing an intermediate value is a trade-off (Fig. 1(b)). Gradient assignment for different samples will greatly affect the final embedding as discussed in the next section.
To illustrate the influence of on the embedding distribution, we show the embeddings obtained on the MNIST dataset in Fig. 3(a)-3(d), when using different values during training. Different colors represents different digits, and each diamond corresponds to the weight of the classifier of the corresponding digit. We slightly shift the weights towards the origin for better visualization. The base model is LeNet (LeCun et al., 2015), and the number of nodes in the second last layer is set to 2 for visualization. In the dataset, 50,000 samples are used for training, and 10,000 different test samples are used to draw the figure. All the features and classifier weights are -normalized during training.
When training with small (i.e. high temperature), the network will assign similar gradients to all the samples (see Fig. 1(a)). Since the “hard” samples are more important to the classifier for improving the accuracy, updating “hard” samples and “easy” samples equally will make the training inefficient or even have difficulty to converge (Fig. 3(a)). However, choosing very large (i.e. very low temperature) for training will assign large gradients to “hard” samples and very small gradients to all “easy” samples ( “boundary” and “centroid”). Since the “boundary” samples will not get enough update, they will stay near the decision boundary. Therefore, the embedding of the samples of the same category will not be compact (Fig. 3(d)). As a consequence, a good trade-off is to use an intermediate temperature for training (see Fig. 1(b)), where “centroid” samples will get small gradients, “boundary” samples will get intermediate gradient, and “hard” samples will get large gradients. Comparing classifiers trained with different values, features of the same category are more compact for the model trained with smaller values (Fig. 3(d) 3(b)).
We further propose a “heating-up” strategy, that consists in starting training with a low or medium temperature, such that “hard” samples will get large gradients to update and become “easy” samples soon. After that, the temperature will be increased and therefore “boundary” and “centroid” samples will also get enough gradients for updating. Therefore, the final embedding of the samples of the same category will become more compact than those of the model trained with the starting temperature only. Multiple strategies could be defined to increase the temperature during training. We tried: (i) gently increasing the temperature; (ii) training with the starting temperature until convergence and using a higher temperature to fine-tune the trained network. Both methods lead to similar performance. However, since the former method would introduce an additional parameter to control the speed of increase of the temperature, we used the latter in our experiments in Sec. 4.
We will here discuss the influence of the normalization, recall that we denote the original embedding as and the -normalized embedding as . The Jacobian matrix of with respect to is:
is the identity matrix. Considering Eq. (3
) and the chain rule, we have:
Considering the norm in the denominator, the magnitude of the gradient is inversely proportional to the norm of the embedding. Therefore, even if the normalized embeddings are the same, the gradients w.r.t. the embeddings are still different for embeddings with different norms. The embedding with larger norm will have smaller gradient. This may seem as a problem, and one possible solution would be to remove the norm term in the denominator. We tried this idea for the experiments in Sec. 4, but it did not give significant improvement. The reason for that is likely that since and are always orthogonal (Wang et al., 2017a), updating the feature along the direction of gradient cannot change much the norm of the feature. We did observe that, when applying -normalization to the feature during training, the norms of features before normalization are very similar. On the contrary, when training without normalization, the norm of the features may have large variations (see Sec. 3.4). For numerical stability and ease of implementation, we use Eq. (6) to calculate the gradient. Therefore, the gradient analysis in Sec. 3.1 and 3.2 still holds for unnormalized features before normalization.
We empirically found out that batch normalization(Ioffe & Szegedy, 2015) without the learned scale333in this paper, batch normalization always refers to batch normalization without learned scale, , works slightly better than -normalization. We propose to define the batch normalized embedding as:
where is the number of dimensions of
. Batch normalization tries to make each dimension of the embedding have zero mean and unit variance. Therefore, after batch normalization, the norm of the embedding is roughly, and the normalized feature has norm close to 1, which is similar to normalization. Batch normalization may work better than -normalization because in fine-grained recognition problem, many embeddings from different categories can be very similar. Batch normalization removes the mean and rescales the embedding thus creating more variance. For classifier weights, -normalization always gives us promising result.
It is interesting to compare the embedding of the proposed approach with the embedding from an off-the-shelf classifier (trained with unnormalized features and weights, = 1) (see Fig. 4(a) for MNIST). As observed in other works (Wang et al., 2017a; Ranjan et al., 2017), (i) the magnitude of the embedding can be extremely large; (ii) the embedding is not “compact”. Even if the feature is -normalized (Fig. 4(b)), the embedding is still not as “compact” as the feature trained with normalization and proper (Fig. 3(b)).
The norm of the embedding tends to be large because “easy sample” with larger norm will produce smaller loss (Wang et al., 2017a). To understand why the embeddings from the same category are not compact, examining the gradient is also the key. By setting in Eq. (3), when , for the “easy” sample, the magnitude of the gradient will approach 0. While for the “hard” sample, the magnitude will approach a constant, generally different from 0. Similarly to Fig. 2(a) and 2(b), the relation between the magnitude of the gradient w.r.t. the norm of the feature for some “hard” and “easy” samples is shown in Fig. 4(c) and 4(d). As boundary features trained with large , boundary features with large norm will not get enough update thus leading to a not compact embedding.
We conduct experiment on the following fine-grained datasets. using the training/test splits of Movshovitz-Attias et al. (2017). In all these datasets, the categories in the training and test splits do not overlap.
Cars (Car196) (Krause et al., 2013) is a fine-grained car category dataset, which contains 16,185 images of 196 car models. 8,054 images of the first 98 categories are used for training, while the 8,131 images of the other 98 categories are used for test.
Caltech-UCSD Birds-200-2011 (CUB200) (Welinder et al., 2010) is a fine-grained bird category dataset which contains 11,788 images of 200 species of birds. 5,864 images of the first 100 species are used for training, while the 5,924 images of the other 100 species are used for test.
Stanford Online Product (Product) dataset (Song et al., 2016) contains 120,053 images of 22,634 products categories. 59,551 images of 11,318 categories are used for training, while the other 60,502 images from 11,316 categories are kept for test.
In-shop Clothes Retrieval (Fashion) dataset (Liu et al., 2016) contains 54,642 images of 11,735 categories of fine-grained clothes. The dataset is split into 3 subsets. 52,712 images of 7,982 categories are used for training. The other 28,760 images of 3985 categories are kept for test, split into a gallery set (12,612 images) and a query set (14,218 images).
will decrease (temperature increases on the other hand) from 16 to 4 and the learning rate will decrease to 1/10 of the original learning rate. The training process usually converges within 50 training epochs, which is similar to the fastest state-of-the-art method, ProxyNCA(Movshovitz-Attias et al., 2017). Our implementation is available at: https://github.com/ColumbiaDVMM/Heated_Up_Softmax_Embedding.
|#DIMModel||SM||= 4||= 8||= 16||= 32||= 64||HBN (|
Following other existing works on metric learning, we evaluate the clustering quality and the retrieval performance on the images of the test set. Following Song et al. (2017), all the features are
-normalized before calculating the evaluation metric. The normalized features performs slightly better than the unnormalized feature.
For clustering, the K-Means algorithm is run on all the embeddings of the test samples. The number of cluster is chosen to be the number of categories in the test set. Each test sample will be assigned a cluster index according to which cluster it belongs to. Normalized Mutual Information (NMI)(Schütze et al., 2008) between the clustering index and the ground-truth label is used as the metric for clustering. Note that NMI is invariant to the label permutation.
For retrieval, the performance is evaluated by Recall@K, which is also a widely used metric for this problem. Given a query sample from the test set, K samples from the rest of the test set (or the gallery set for the fashion dataset) with the smallest distance are retrieved. If any retrieved samples is from the same category as the query sample, the recall for this sample is set to 1, otherwise, 0. The reported Recall@K is the average recall on the whole test set.
We train a classifier on the training dataset with the softmax function and cross-entropy as baseline (SM). For the baseline classifier, in training, the features and the weights are not normalized and is set to 1. 4 different versions of classifiers trained with the proposed methods are used for evaluation:
LN: softmax with -normalized embedding, -normalized weights and .
BN: softmax with batch normalized embedding, -normalized weights and .
HLN: Heated-up model using to fine-tune LN.
HBN: Heated-up model using to fine-tune BN.
We also compare the proposed method with multiple state-of-the-art metric learning methods. Existing literatures are using different base networks and different evaluation protocols. For fair comparison, only the methods using GoogleNetV1 as base network and Euclidean distance as the final evaluation metric are listed:  Triplet learning with semi-hard negative mining (Schroff et al., 2015),  Lifted structured loss (Song et al., 2016),  Learnable Structured Clustering (Song et al., 2017);  Deep Clustering Learning without spectral learning (Law et al., 2017);  Deep Metric Learning with Smart Mining (Harwood et al., 2017) and  ProxyNCA (Movshovitz-Attias et al., 2017). For the Fashion dataset, we compared with  FashionNet (Liu et al., 2016) and  Hard-Aware Deeply Cascaded Embedding (Yuan et al., 2017).
The performances of all the methods on all 4 datasets are listed in Tables 1, 2, 3 and 4 respectively. The softmax baseline already shows comparable result with many other triplet loss based methods. The embeddings trained with either -normalization or batch normalization improve the performance of the softmax baseline. Since in test phase, all the features are normalized before calculating the metric, the performance gain is not coming from a simple normalization to the final features. Batch normalization works slightly better than -normalization. The “heated-up” models (HLN and HBN) show better performance in almost all the metrics compared to the embedding trained with a fixed temperature.
We further study how different embedding sizes and values affect the retrieval performance. The size of the embedding is chosen in , and the value in . The R@1 metric on the test set with different embedding sizes and values is reported in Tab. 5. Performances of the feature learned by softmax function without normalization and feature learned from “heated-up” model are also given. The “heated-up” model outperforms all the other models by a significant margin in all cases. Between models trained with fixed values, the model trained with outperforms others.
We have discussed how the temperature parameter in the softmax function affects the distribution of the embedding in the second last layer of a deep classification model. Training with an intermediate temperature will lead to an intra-category compact and inter-category “spread-out” embedding which is beneficial for both clustering and retrieval. A "heating-up" method is also proposed to further improve the clustering and retrieval performance of the embedding by fine-tuning with a higher temperature. Our classifier based approach achieves good performance in metric learning problems with a simpler and more efficient training process than state-of-the-art methods.
This material is based upon work supported by the United States Air Force Research Laboratory (AFRL) and the Defense Advanced Research Projects Agency (DARPA) under Contract No. FA8750-16-C-0166. Any opinions, findings and conclusions or recommendations expressed in this material are solely the responsibility of the authors and does not necessarily represent the official views of AFRL, DARPA, or the U.S. Government.
International Workshop on Similarity-Based Pattern Recognition, 2015. Springer, 2015.
Deep Spectral Clustering Learning.In ICML 2017, 2017.
Lenet-5, Convolutional Neural Networks.url: http://yann.lecun.com/exdb/lenet, 2015.