Significance of Softmax-based Features in Comparison to Distance Metric Learning-based Features

12/29/2017 ∙ by Shota Horiguchi, et al. ∙ 0

The extraction of useful deep features is important for many computer vision tasks. Deep features extracted from classification networks have proved to perform well in those tasks. To obtain features of greater usefulness, end-to-end distance metric learning (DML) has been applied to train the feature extractor directly. However, in these DML studies, there were no equitable comparisons between features extracted from a DML-based network and those from a softmax-based network. In this paper, by presenting objective comparisons between these two approaches under the same network architecture, we show that the softmax-based features perform competitive, or even better, to the state-of-the-art DML features when the size of the dataset, that is, the number of training samples per class, is large. The results suggest that softmax-based features should be properly taken into account when evaluating the performance of deep features.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent developments in deep convolutional neural networks have made it possible to classify many classes of images with high accuracy. It has also been shown that such classification networks work well as feature extractors. Features extracted from classification networks show excellent performance in image classification

[1], detection, and retrieval [2][3]

, even when they have been trained to classify 1000 classes of the ImageNet dataset

[4]. It has also been shown that fine-tuning for target domains further improves the features’ performance [5][6].

(a) Siamese ()
(b) Softmax ()
(c) Softmax () + L2 normalization

(d)
Fig. 5: Depiction of MNIST dataset. (a) Two-dimensional features obtained by siamese network. (b) Two-dimensional features extracted from softmax-based classifier; these features are well separated by angle but not by Euclidean norm. (c) Three-dimensional features extracted from softmax-based classifier; we normalized these to have unit L2 norm and depict them in an azimuth–elevation coordinate system. The three-dimensional features are well separated by their classes.

On the other hand, distance metric learning (DML) approaches have recently attracted considerable attention. These obtain a feature space in which distance corresponds to class similarity; it is not a byproduct of the classification network. End-to-end distance metric learning is a typical approach to constructing a feature extractor using convolutional neural networks and has been the focus of numerous studies [7, 8, 9, 10, 11].

However, there have been no experiments comparing softmax-based features with DML-based features under the same network architecture or with adequate fine-tuning. An analysis providing a true comparison of DML features and softmax-based features is long overdue.

Fig. 5

depicts the feature vectors extracted from a softmax-based classification network and a metric learning-based network. We used LeNet architecture for both networks, and trained on the MNIST dataset

[12]

. For DML, we used the contrastive loss function

[13] to map images in two-dimensional space. For softmax-based classification, we added a two- or three-dimensional fully connected layer before the output layer for visualization. DML succeeds in learning feature embedding (Fig. (a)a). Softmax-based classification networks can also achieve a result very similar to that obtained by DML— Images are located near one another if they belong to the same class and far apart otherwise (Fig. (b)b, Fig. (c)c).

Our contributions in this paper are as follows:

  • We show methods to exploit the ability of deep features extracted from softmax-based networks, such as normalization and proper dimensionality reduction. They are technically not novel, but they must be used for fair comparison between the image representations.

  • We demonstrate that deep features extracted from softmax-based classification networks show competitive, or better results on clustering and retrieval tasks comparing to those from state-of-the-art DML-based networks [9, 10, 11] in the Caltech UCSD Birds 200-2011 dataset and the Stanford Cars 196 dataset.

  • We show how the clustering and retrieval performances of softmax-based features and DML features change according to the size of the dataset. DML features show competitive or better performance in the stanford Online Product dataset which consists of very small number of samples per class.

In order to align the condition of the network architecture, we restrict the network architecture to GoogLeNet [14] which has been used in state-of-the-art of DML studies [9, 10, 11].

2 Background

2.1 Previous Work

2.1.1 Softmax-Based Classification and Repurposing of the Classifier as a Feature Extractor

Convolutional neural networks have demonstrated great potential for highly accurate image recognition [15][16][14][17]. It has been shown that features extracted from classification networks can be repurposed as a good feature representation for novel tasks [1][2][18] even if the network was trained on ImageNet [4]. For obtaining better feature representations, fine-tuning is also effective [6].

2.1.2 Deep Distance Metric Learning

Distance metric learning (DML), which learns a distance metric, has been widely studied [19][20][21][18]. Recent studies have focused on end-to-end deep distance metric learning [7][8][9][10][11]. However, in most studies comparisons of end-to-end DML with features extracted from classification networks have not been performed using architectures and conditions suited to enable a true comparison of performance.

Bell and Bala[7] compared classification networks and siamese networks, but they used coarse class labels for classification networks and fine labels for siamese networks; thus, it was left unclear whether siamese networks are better for feature-embedding learning than classification networks. Schroff et al.[8] used triplet loss for deep metric learning in their FaceNet, which showed performance that was state-of-the-art at the time, but their network was deeper than that of the previous method (Taigman et al.[22]); thus, triplet loss might not have been the only reason for the performance improvement, and the contribution from adopting triplet loss remains uncertain. Song et al.[9] used lifted structured feature embedding; however, they only compared their method with a softmax-based classification network pretrained on ImageNet (Russakovsky et al.,[4]) and did not compare it with a fine-tuned network. Sohn[10], and Song et al.[11] also compared their methods to lifted structured feature embedding, thus the comparisons with softmax-based features have not been shown.

2.2 Differences Between Softmax-based Classification and Metric Learning

Fig. 6: Illustration of learning processes for softmax-based classification network and siamese-based DML network. For softmax, the gradient is defined by the distance between a sample and a fixed one-hot vector; for siamese by the distance between samples.

For classification, the softmax function (Eq. 1) is typically used:

(1)

where

denotes the probability that the vector

belongs to the class . The loss of the softmax function is defined by the cross-entropy

(2)

where

is a one-hot encoding of the correct class of

. To minimize the cross-entropy loss, networks are trained to make the output vector

close to its corresponding one-hot vector. It is important to note that the target vectors (the correct outputs of the network) are fixed during the entire training (Fig. 6).

On the other hand, DML methods use distance between samples. They do not use the values of the labels; rather, they ascertain whether the labels are the same between target samples. For example, contrastive loss [13] considers the distance between a pair of samples. Recent studies [8][9][11][10] use pairwise distances between three or more images at the same time for fast convergence and efficient calculation. However, these methods have some drawbacks. For DML, in contrast to optimization of the softmax cross-entropy loss, the optimization targets are not always consistent during training even if all possible distances within the mini-batch are considered. Thus, the DML optimization converges slowly and is not stable.

(a) GoogLeNet (dimensionality is reduced to n by PCA)
(b) GoogLeNet with dimensionality reduction by a fully connected layer just before the output layer (FCR1)
(c) GoogLeNet with dimensionality reduction by a fully connected layer followed by a dropout layer (FCR2)
Fig. 10: GoogLeNet [14] architecture we use in this paper. We extracted the features of the red-colored layers. For (a), we applied PCA to reduce the number of feature dimensions. For (b) and (c), the dimensionality is reduced by the fc_reduction layer.
Dataset Train Test Total
CUB [23] 5,864 5,924 11,788
100 100 200
CAR [24] 8,054 8,131 16,185
98 98 196
OP [9] 59,551 60,502 120,053
11,318 11,316 22,634
TABLE I: Properties of datasets used in our experiments. Each cell shows the number of images (upper figure) and the number of classes (lower figure).
(clustering) Recall@K (retrieval)
dim NMI K=1 K=2 K=4 K=8
Lifted struct [9] 64 56.5 43.6 56.6 68.6 79.6
64 (56.0) (42.7) (55.0) (67.2) (78.1)
N-pair loss [10] 64 57.2 45.4 58.4 69.5 79.5
Clustering loss [11] 64 59.2 48.2 61.4 71.8 81.9
PCA + L2 64 60.8 51.1 64.0 75.3 84.0
FCR1 + L2 64 59.1 49.0 61.1 72.7 82.3
FCR2 + L2 64 57.4 48.0 60.3 72.2 81.6
TABLE II: CUB: NMI (clustering) and Recall@K (retrieval) scores for the test set of the Caltech UCSD Birds 200-2011 (CUB) dataset.
(clustering) Recall@K (retrieval)
dim NMI K=1 K=2 K=4 K=8
Lifted struct [9] 64 56.9 53.0 65.7 76.0 84.0
64 (57.1) (50.5) (63.6) (74.9) (83.6)
N-pair loss [10] 64 57.8 53.9 66.8 77.8 86.4
Clustering loss [11] 64 59.0 58.1 70.6 80.3 87.8
PCA + L2 64 58.3 69.4 80.0 87.2 92.4
FCR1 + L2 64 58.7 66.7 77.7 85.2 90.8
FCR2 + L2 64 60.4 67.9 78.4 86.1 91.3
TABLE III: CAR: NMI (clustering) and Recall@K (retrieval) scores for the test set of the Stanford Cars 196 (CAR) dataset.
(clustering) Recall@K (retrieval)
dim NMI K=1 K=10 K=100
Lifted struct [9] 64 88.7 62.5 80.8 91.9
64 (87.7) (61.0) (79.9) (91.5)
N-pair loss [10] 64 89.4 66.4 83.2 93.0
Clustering loss [11] 64 89.5 67.0 83.7 93.2
PCA + L2 64 87.5 62.4 78.9 89.7
FCR1 + L2 64 87.7 61.3 78.6 90.1
FCR2 + L2 64 87.9 62.5 79.8 90.8
TABLE IV: OP: NMI (clustering) and Recall@K (retrieval) scores for the test set of the Online Product (OP) dataset.

3 Methods

3.1 Dimensionality Reduction Layer

One of DML’s strength in using fine-tuning is the flexibility of its output dimensionality by a final fully connected layer. When using features of a mid-layer of a softmax classification network, on the other hand, the dimensionality of the features is fixed. Some existing methods [6] use PCA or discriminative dimensionality reduction to reduce the number of feature dimensions. In our experiment, we evaluated three methods for changing the feature dimensionality. Following conventional PCA approaches, we extracted features from a 1024-dimensional pool5 layer of GoogLeNet [14] (Fig. (a)a

) and applied PCA to reduce the dimensionality. In a contrasting approach, we made use of a fully connected layer—we added a fully connected layer having the required number of neurons just before the output layer (FCR1, Fig. 

(b)b). We also investigated a third approach in which a fully connected layer is added followed by a dropout layer (FCR2, Fig. (c)c).

3.2 Normalization

In this study, all the features extracted from the classification networks are from the last layer before the last output layer. The outputs are normalized by the softmax function and then evaluated by the cross-entropy loss function in the networks. Assume that the output vector is where . For arbitrary positive constant , returns the same vector after the softmax function is applied. The features we extract from the networks are given as , where denotes the linear projection matrix from the layer before the output layer to the output layer. The vector

has an ambiguity in its scale, thus its linear transformed vector

also has an ambiguity in the scale—therefore should be normalized. As Fig. (b)b

clearly indicates, the distance between features extracted from a softmax-based classifier should be evaluated by cosine similarity, not by the Euclidean distance.

Some studies used L2 normalization for deep features extracted from softmax-based classification networks [22][6], whereas many recent studies have used the features without any normalization [15][9, 25]. In this study, we also validate the efficacy of normalizing deep features.

Fig. 11: CUB: Comparisons between softmax-based features and lifted structured feature embedding [9] on NMI (clustering), and Recall@K (retrieval) scores for the test set of the Caltech UCSD Birds 200-2011 (CUB) dataset. The dimension of the feature used in the retrieval experiments is 64.

Fig. 12: CAR: Comparisons between softmax-based features and lifted structured feature embedding [9] on NMI (clustering), and Recall@K (retrieval) scores for the test set of the Stanford Cars 196 (CAR) dataset. The dimension of the feature used in the retrieval experiments is 64.
Fig. 13: OP: Comparisons between softmax-based features and lifted structured feature embedding [9] on NMI (clustering), and Recall@K (retrieval) scores for the test set of the Online Products (OP) dataset.The dimension of the feature used in the retrieval experiments is 64.

4 Experiments

In this section, we compared the deep features extracted from classification networks to those from state-of-the-art DML-based networks [9][10][11]. The GoogLeNet architecture [14] was used for all the methods—thus, the numbers of parameters are the same between DML-based networks and softmax-based features. All the networks were fine-tuned from the weights pretrained on ImageNet [4]

. We used the Caffe

[26] framework for the implementation.

4.1 Comparisons between softmax-based features and DML-based features

Here, we give our evaluation of clustering and retrieval scores for the state-of-the-art DML methods [9][10][11] and for the softmax classification networks. We used the Caltech UCSD Birds 200-2011 (CUB) dataset [23], the Stanford Cars 196 (CAR) dataset [24], and the Stanford Online Products (OP) dataset [9]. For CUB and CAR, we used the first half of the dataset classes for training and the rest for testing. For OP, we used the training–testing class split provided. The dataset properties are shown in Table I. We emphasize that the class sets used for training and testing were completely different.

For clustering evaluation, we applied k-means clustering 100 times and calculated NMI (Normalized Mutual Information)

[27]; the value for was set to the number of classes in the test set. For retrieval evaluation, we calculated Recall@K [28].

In Table II and Table III, we show comparisons of the performance of clustering and retrieval using NMI and Recall@K scores, respectively, for CUB and CAR datasets. We compared the three softmax-based features, lifted structure[9], N-pair loss [10] and the clustering loss [11]. The results of the DML methods were quoted from the paper [11]. Regarding the lifted structure[9], the results in the parenthesis correspond to the scores we obtained from running the publicly available code ourselves, which we confirmed were almost the same as those in [11]. As we can see from Table II and Table III, softmax-based features outperformed DML features. The softmax-based features all performed well in the two datasets.

In OP dataset shown in Table IV, contrasting to CUB and CAR datasets, DML features outperform softmax-based features. We will make detailed analysis in the subsequent section.

Fig. 14: CUB: NMI (clustering), and Recall@K (retrieval) scores for test set of the Caltech UCSD Birds 200-2011 dataset under different dataset sizes. The feature dimensionality is fixed at 1024.

Fig. 15: CAR: NMI (clustering), and Recall@K (retrieval) scores for test set of the Stanford Cars 196 dataset under different dataset sizes. The feature dimensionality is fixed at 256.

4.2 Detailed comparisons between softmax-based features and lifted structure embedding features

We made detailed comparisons between softmax-based features and lifted structure embedding [9] when changing dimensionalities and size of data. We conducted these experiments using the code available for lifted structure embedding [9].

Firstly, we show how the performance varies when changing the feature dimensionalities. We changed the dimensionalities of softmax-based features via PCA, FCR1 and FCR2, and investigated how the performance of clustering and retrieval varied. We compared them against those of lifted structure embedding of the same dimensionality.

For training, we multiplied the learning rates of the changed layers (output layers for all models and the fully connected layer added for FCR1 and FCR2) by 10. The batch size was set to 128, and the maximum number of iterations for our training was set to 20,000, which was large enough for the three datasets to converge as mentioned in [11]. These training strategies were exactly the same as those used in [9].

We show the results for CUB and CAR datasets in Fig. 13 and in Fig. 13, respectively, under varying dimensionalities. The deep features extracted from the softmax-based classification networks outperformed the lifted structured feature embedding in clustering (NMI) and retrieval (Recall@K).

For clustering performance measured by NMI, all of the softmax models (PCA, FCR1, and FCR2) showed better scores than the lifted structured feature embedding. Regarding normalization, softmax-based features with L2 normalization showed better performance than those without normalization.

The NMI scores of PCA, FCR1 and FCR2 monotonically increased as the feature dimensionality increased for the CUB dataset (Fig. 13). On the other hand, in CAR dataset (Fig. 13), the NMI scores of FCR2 and the lifted structure embeddings decreased from 256 dimensions and those of PCA and FCR1 were saturated above 256 dimensions. This experimental result shows that 1024 dimensions is too large to represent the image classes of CAR dataset. It also implies that the feature dimensionality should be carefully considered in order to achieve best performance depending on the target data.

For retrieval performance measured by Recall@K metric, the softmax-based features also outperformed features of lifted structured feature embedding. Regarding L2 normalization, features with normalization showed better score than without L2-normalization.

Fig. 13 shows the clustering and retrieval performance measured by NMI, and Recall@K, respectively, for the Online Products dataset. Contrasting to CUB and CAR datasets, the softmax-based features with L2 normalization and the lifted structure embedding showed almost the same performance in the clustering and retrieval. As shown in Table I, the OP dataset is very different from the CUB and CAR datasets in terms of the number of classes and the number of samples per class—the number of classes is 22k and the number of samples is 120k. The number of samples per class in the OP dataset is 5.3 on average, which is far smaller than the CUB and CAR dataset.

4.3 The effect of the dataset scales

From the results for these three datasets, we conjecture that the dataset size—that is the number of samples per class—has a considerable influence on softmax-based features. Hence, we changed the size of datasets by sampling the images of CUB and CAR datasets for each class and ran the experiments again. We constructed seven datasets of different sizes, containing 5, 10, 20, 40, 60, 80, and 100% of the whole dataset, respectively. Among them, 5% corresponds to approximately 3 and 4 images per class in the CUB and the CAR dataset, respectively. As shown in Fig. 15 and Fig. 15, the differences between the scores for softmax and DML were small if the size of the training dataset was small. The gap between softmax and DML became larger as the dataset size increased. The softmax-based classifier was largely influenced by the size of the dataset.

5 Conclusion

Because there was no equitable comparison in previous studies, we conducted comparisons of the softmax-based features and the state-of-the-art DML features using a design that would enable these methods to objectively demonstrate their true performance capabilities. Our results showed that the features extracted from softmax-based classifiers performed better than those from state-of-the-art DML methods [9][10][11] on fine-grained classification, clustering, and retrieval tasks when the size of the training dataset (samples per class) is large. The results also showed that the size of the dataset largely influenced the performace of softmax-based features. When the size of the dataset was small, DML showed better or competitive performance. DML methods have advantages when the number of classes is very large and the softmax-based classifier is no longer applicable. In DML studies, softmax-based feature have rarely been compared fairly with DML-based feature under the same network architecture or with adequate fine-tuning. This paper revealed that the softmax-based features are still strong baselines. The results suggest that fine-tuned softmax-based features should be taken into account when evaluating the performance of deep features.

References

  • [1] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition, ICML, pp.647–655, 2014
  • [2] A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson, CNN Features Off-the-Shelf: An Astounding Baseline for Recognition, CVPR Workshops, pp.512–519, 2014
  • [3]

    Y. Liu, Y. Guo, S. Wu, Song, M. S. Lew, DeepIndex for Accurate and Efficient Image Retrieval, ICMR, pp. 43–50, 2015

  • [4] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, Fei-Fei Li, ImageNet Large Scale Visual Recognition Challenge, IJCV, Vol. 115, No. 3, pp.211-252, 2015
  • [5] J. Wan, D. Wang, S. H. Hoi, P. Wu, J. Zhu, Y. Zhang, J. Li, Deep Learning for Content-Based Image Retrieval: A Comprehensive Study, ACM Multimedia, pp.157–166, 2014
  • [6] A. Babenko, Artem A. Slesarev, A. Chigorin, V. Lempitsky, Neural Codes for Image Retrieval, ECCV, pp.584–599, 2014
  • [7] S. Bell, K. Bala, Learning Visual Similarity for Product Design with Convolutional Neural Networks, SIGGRAPH, 2015
  • [8]

    F. Schroff, Florian, D. Kalenichenko, J. Philbin, FaceNet: A Unified Embedding for Face Recognition and Clustering, CVPR, pp.815–823, 2015

  • [9] H. O. Song, Y. Xiang, S. Jegelka, S. Savarese, Deep Metric Learning via Lifted Structured Feature Embedding, CVPR, pp.4004–4012, 2016
  • [10] K. Sohn, Improved Deep Metric Learning with Multi-class N-pair Loss Objective, NIPS, pp.1857–1865, 2016
  • [11] Hyun Oh Song, S. Jegelka, V. Rathod, K. Murphy, Deep Metric Learning via Facility Location CVPR, pp.2206–2214, 2017
  • [12] Y. LeCun, L. Bottou and Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, Vol. 86, No.11 pp.2278-2324, 1998
  • [13] R. Hadsell, S. Chopra, Y. LeCun, Dimensionality reduction by learning an invariant mapping, CVPR, pp.1735–1742, 2006
  • [14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going Deeper with Convolutions, CVPR, pp.1–9, 2015
  • [15] A. Krizhevsky, I. Sutskever, G. E. Hinton ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 1097–1105, 2012
  • [16] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, ICLR, 2015
  • [17] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, CVPR, pp. 770–778, 2016
  • [18] Q. Qian, R. Jing, S. Zhu, Y. Lin, Fine-grained visual categorization via multi-stage metric learning, CVPR, pp.3716–3724, 2015
  • [19] J. Bromley, I. Guyon, Y. LeCun, S. Eduard, R. Shah, Signature Verification using a ”Siamese” Time Delay Neural Network, NIPS, pp. 737–744, 1994
  • [20] S. Chopra, R. Hadsell, Y. LeCun, Learning a similarity metric discriminatively, with application to face verification, CVPR, pp. 539–546, 2005
  • [21] Large scale online learning of image similarity through ranking, G. Chechik, V. Sharma, U. Shalit, S. Bengio, JMLR, pp.1109–1135, 2010
  • [22] Y. Taigman, M. Yang, M. Ranzato, L. Wolf, DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR, pp. 1701-1708, 2014
  • [23] C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The Caltech-UCSD Birds-200-2011 Dataset, California Institute of Technology, CNS-TR-2011-001, 2011
  • [24] J. Krause, M. Stark, J. Deng, Fei-Fei Li, 3D Object Representations for Fine-Grained Categorization, 4th International IEEE Workshop on 3D Representation and Recognition, pp.554-561, 2013
  • [25] L. Wei, Q. Huang, D. Ceylan, E. Vouga, H. Li, Dense Human Body Correspondences Using Convolutional Networks, CVPR, pp.1544–1553, 2016
  • [26] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional Architecture for Fast Feature Embedding arXiv preprint arXiv:1408.5093, 2014
  • [27] C. D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008
  • [28] H. Jegou, M. Douze, C. Schmid, Product Quantization for Nearest Neighbor Search, IEEE Trans. PAMI, Vol. 33, No.1 pp.117-128, 2011