Recent developments in deep convolutional neural networks have made it possible to classify many classes of images with high accuracy. It has also been shown that such classification networks work well as feature extractors. Features extracted from classification networks show excellent performance in image classification, detection, and retrieval 
, even when they have been trained to classify 1000 classes of the ImageNet dataset. It has also been shown that fine-tuning for target domains further improves the features’ performance .
On the other hand, distance metric learning (DML) approaches have recently attracted considerable attention. These obtain a feature space in which distance corresponds to class similarity; it is not a byproduct of the classification network. End-to-end distance metric learning is a typical approach to constructing a feature extractor using convolutional neural networks and has been the focus of numerous studies [7, 8, 9, 10, 11].
However, there have been no experiments comparing softmax-based features with DML-based features under the same network architecture or with adequate fine-tuning. An analysis providing a true comparison of DML features and softmax-based features is long overdue.
depicts the feature vectors extracted from a softmax-based classification network and a metric learning-based network. We used LeNet architecture for both networks, and trained on the MNIST dataset
. For DML, we used the contrastive loss function to map images in two-dimensional space. For softmax-based classification, we added a two- or three-dimensional fully connected layer before the output layer for visualization. DML succeeds in learning feature embedding (Fig. (a)a). Softmax-based classification networks can also achieve a result very similar to that obtained by DML— Images are located near one another if they belong to the same class and far apart otherwise (Fig. (b)b, Fig. (c)c).
Our contributions in this paper are as follows:
We show methods to exploit the ability of deep features extracted from softmax-based networks, such as normalization and proper dimensionality reduction. They are technically not novel, but they must be used for fair comparison between the image representations.
We demonstrate that deep features extracted from softmax-based classification networks show competitive, or better results on clustering and retrieval tasks comparing to those from state-of-the-art DML-based networks [9, 10, 11] in the Caltech UCSD Birds 200-2011 dataset and the Stanford Cars 196 dataset.
We show how the clustering and retrieval performances of softmax-based features and DML features change according to the size of the dataset. DML features show competitive or better performance in the stanford Online Product dataset which consists of very small number of samples per class.
2.1 Previous Work
2.1.1 Softmax-Based Classification and Repurposing of the Classifier as a Feature Extractor
Convolutional neural networks have demonstrated great potential for highly accurate image recognition . It has been shown that features extracted from classification networks can be repurposed as a good feature representation for novel tasks  even if the network was trained on ImageNet . For obtaining better feature representations, fine-tuning is also effective .
2.1.2 Deep Distance Metric Learning
Distance metric learning (DML), which learns a distance metric, has been widely studied . Recent studies have focused on end-to-end deep distance metric learning . However, in most studies comparisons of end-to-end DML with features extracted from classification networks have not been performed using architectures and conditions suited to enable a true comparison of performance.
Bell and Bala compared classification networks and siamese networks, but they used coarse class labels for classification networks and fine labels for siamese networks; thus, it was left unclear whether siamese networks are better for feature-embedding learning than classification networks. Schroff et al. used triplet loss for deep metric learning in their FaceNet, which showed performance that was state-of-the-art at the time, but their network was deeper than that of the previous method (Taigman et al.); thus, triplet loss might not have been the only reason for the performance improvement, and the contribution from adopting triplet loss remains uncertain. Song et al. used lifted structured feature embedding; however, they only compared their method with a softmax-based classification network pretrained on ImageNet (Russakovsky et al.,) and did not compare it with a fine-tuned network. Sohn, and Song et al. also compared their methods to lifted structured feature embedding, thus the comparisons with softmax-based features have not been shown.
2.2 Differences Between Softmax-based Classification and Metric Learning
For classification, the softmax function (Eq. 1) is typically used:
denotes the probability that the vectorbelongs to the class . The loss of the softmax function is defined by the cross-entropy
is a one-hot encoding of the correct class of
. To minimize the cross-entropy loss, networks are trained to make the output vectorclose to its corresponding one-hot vector. It is important to note that the target vectors (the correct outputs of the network) are fixed during the entire training (Fig. 6).
On the other hand, DML methods use distance between samples. They do not use the values of the labels; rather, they ascertain whether the labels are the same between target samples. For example, contrastive loss  considers the distance between a pair of samples. Recent studies  use pairwise distances between three or more images at the same time for fast convergence and efficient calculation. However, these methods have some drawbacks. For DML, in contrast to optimization of the softmax cross-entropy loss, the optimization targets are not always consistent during training even if all possible distances within the mini-batch are considered. Thus, the DML optimization converges slowly and is not stable.
|Lifted struct ||64||56.5||43.6||56.6||68.6||79.6|
|N-pair loss ||64||57.2||45.4||58.4||69.5||79.5|
|Clustering loss ||64||59.2||48.2||61.4||71.8||81.9|
|PCA + L2||64||60.8||51.1||64.0||75.3||84.0|
|FCR1 + L2||64||59.1||49.0||61.1||72.7||82.3|
|FCR2 + L2||64||57.4||48.0||60.3||72.2||81.6|
|Lifted struct ||64||56.9||53.0||65.7||76.0||84.0|
|N-pair loss ||64||57.8||53.9||66.8||77.8||86.4|
|Clustering loss ||64||59.0||58.1||70.6||80.3||87.8|
|PCA + L2||64||58.3||69.4||80.0||87.2||92.4|
|FCR1 + L2||64||58.7||66.7||77.7||85.2||90.8|
|FCR2 + L2||64||60.4||67.9||78.4||86.1||91.3|
|Lifted struct ||64||88.7||62.5||80.8||91.9|
|N-pair loss ||64||89.4||66.4||83.2||93.0|
|Clustering loss ||64||89.5||67.0||83.7||93.2|
|PCA + L2||64||87.5||62.4||78.9||89.7|
|FCR1 + L2||64||87.7||61.3||78.6||90.1|
|FCR2 + L2||64||87.9||62.5||79.8||90.8|
3.1 Dimensionality Reduction Layer
One of DML’s strength in using fine-tuning is the flexibility of its output dimensionality by a final fully connected layer. When using features of a mid-layer of a softmax classification network, on the other hand, the dimensionality of the features is fixed. Some existing methods  use PCA or discriminative dimensionality reduction to reduce the number of feature dimensions. In our experiment, we evaluated three methods for changing the feature dimensionality. Following conventional PCA approaches, we extracted features from a 1024-dimensional pool5 layer of GoogLeNet  (Fig. (a)a
) and applied PCA to reduce the dimensionality. In a contrasting approach, we made use of a fully connected layer—we added a fully connected layer having the required number of neurons just before the output layer (FCR1, Fig.(b)b). We also investigated a third approach in which a fully connected layer is added followed by a dropout layer (FCR2, Fig. (c)c).
In this study, all the features extracted from the classification networks are from the last layer before the last output layer. The outputs are normalized by the softmax function and then evaluated by the cross-entropy loss function in the networks. Assume that the output vector is where . For arbitrary positive constant , returns the same vector after the softmax function is applied. The features we extract from the networks are given as , where denotes the linear projection matrix from the layer before the output layer to the output layer. The vector
has an ambiguity in its scale, thus its linear transformed vectoralso has an ambiguity in the scale—therefore should be normalized. As Fig. (b)b
clearly indicates, the distance between features extracted from a softmax-based classifier should be evaluated by cosine similarity, not by the Euclidean distance.
In this section, we compared the deep features extracted from classification networks to those from state-of-the-art DML-based networks . The GoogLeNet architecture  was used for all the methods—thus, the numbers of parameters are the same between DML-based networks and softmax-based features. All the networks were fine-tuned from the weights pretrained on ImageNet 
. We used the Caffe framework for the implementation.
4.1 Comparisons between softmax-based features and DML-based features
Here, we give our evaluation of clustering and retrieval scores for the state-of-the-art DML methods  and for the softmax classification networks. We used the Caltech UCSD Birds 200-2011 (CUB) dataset , the Stanford Cars 196 (CAR) dataset , and the Stanford Online Products (OP) dataset . For CUB and CAR, we used the first half of the dataset classes for training and the rest for testing. For OP, we used the training–testing class split provided. The dataset properties are shown in Table I. We emphasize that the class sets used for training and testing were completely different.
For clustering evaluation, we applied k-means clustering 100 times and calculated NMI (Normalized Mutual Information); the value for was set to the number of classes in the test set. For retrieval evaluation, we calculated Recall@K .
In Table II and Table III, we show comparisons of the performance of clustering and retrieval using NMI and Recall@K scores, respectively, for CUB and CAR datasets. We compared the three softmax-based features, lifted structure, N-pair loss  and the clustering loss . The results of the DML methods were quoted from the paper . Regarding the lifted structure, the results in the parenthesis correspond to the scores we obtained from running the publicly available code ourselves, which we confirmed were almost the same as those in . As we can see from Table II and Table III, softmax-based features outperformed DML features. The softmax-based features all performed well in the two datasets.
In OP dataset shown in Table IV, contrasting to CUB and CAR datasets, DML features outperform softmax-based features. We will make detailed analysis in the subsequent section.
4.2 Detailed comparisons between softmax-based features and lifted structure embedding features
We made detailed comparisons between softmax-based features and lifted structure embedding  when changing dimensionalities and size of data. We conducted these experiments using the code available for lifted structure embedding .
Firstly, we show how the performance varies when changing the feature dimensionalities. We changed the dimensionalities of softmax-based features via PCA, FCR1 and FCR2, and investigated how the performance of clustering and retrieval varied. We compared them against those of lifted structure embedding of the same dimensionality.
For training, we multiplied the learning rates of the changed layers (output layers for all models and the fully connected layer added for FCR1 and FCR2) by 10. The batch size was set to 128, and the maximum number of iterations for our training was set to 20,000, which was large enough for the three datasets to converge as mentioned in . These training strategies were exactly the same as those used in .
We show the results for CUB and CAR datasets in Fig. 13 and in Fig. 13, respectively, under varying dimensionalities. The deep features extracted from the softmax-based classification networks outperformed the lifted structured feature embedding in clustering (NMI) and retrieval (Recall@K).
For clustering performance measured by NMI, all of the softmax models (PCA, FCR1, and FCR2) showed better scores than the lifted structured feature embedding. Regarding normalization, softmax-based features with L2 normalization showed better performance than those without normalization.
The NMI scores of PCA, FCR1 and FCR2 monotonically increased as the feature dimensionality increased for the CUB dataset (Fig. 13). On the other hand, in CAR dataset (Fig. 13), the NMI scores of FCR2 and the lifted structure embeddings decreased from 256 dimensions and those of PCA and FCR1 were saturated above 256 dimensions. This experimental result shows that 1024 dimensions is too large to represent the image classes of CAR dataset. It also implies that the feature dimensionality should be carefully considered in order to achieve best performance depending on the target data.
For retrieval performance measured by Recall@K metric, the softmax-based features also outperformed features of lifted structured feature embedding. Regarding L2 normalization, features with normalization showed better score than without L2-normalization.
Fig. 13 shows the clustering and retrieval performance measured by NMI, and Recall@K, respectively, for the Online Products dataset. Contrasting to CUB and CAR datasets, the softmax-based features with L2 normalization and the lifted structure embedding showed almost the same performance in the clustering and retrieval. As shown in Table I, the OP dataset is very different from the CUB and CAR datasets in terms of the number of classes and the number of samples per class—the number of classes is 22k and the number of samples is 120k. The number of samples per class in the OP dataset is 5.3 on average, which is far smaller than the CUB and CAR dataset.
4.3 The effect of the dataset scales
From the results for these three datasets, we conjecture that the dataset size—that is the number of samples per class—has a considerable influence on softmax-based features. Hence, we changed the size of datasets by sampling the images of CUB and CAR datasets for each class and ran the experiments again. We constructed seven datasets of different sizes, containing 5, 10, 20, 40, 60, 80, and 100% of the whole dataset, respectively. Among them, 5% corresponds to approximately 3 and 4 images per class in the CUB and the CAR dataset, respectively. As shown in Fig. 15 and Fig. 15, the differences between the scores for softmax and DML were small if the size of the training dataset was small. The gap between softmax and DML became larger as the dataset size increased. The softmax-based classifier was largely influenced by the size of the dataset.
Because there was no equitable comparison in previous studies, we conducted comparisons of the softmax-based features and the state-of-the-art DML features using a design that would enable these methods to objectively demonstrate their true performance capabilities. Our results showed that the features extracted from softmax-based classifiers performed better than those from state-of-the-art DML methods  on fine-grained classification, clustering, and retrieval tasks when the size of the training dataset (samples per class) is large. The results also showed that the size of the dataset largely influenced the performace of softmax-based features. When the size of the dataset was small, DML showed better or competitive performance. DML methods have advantages when the number of classes is very large and the softmax-based classifier is no longer applicable. In DML studies, softmax-based feature have rarely been compared fairly with DML-based feature under the same network architecture or with adequate fine-tuning. This paper revealed that the softmax-based features are still strong baselines. The results suggest that fine-tuned softmax-based features should be taken into account when evaluating the performance of deep features.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition, ICML, pp.647–655, 2014
-  A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson, CNN Features Off-the-Shelf: An Astounding Baseline for Recognition, CVPR Workshops, pp.512–519, 2014
Y. Liu, Y. Guo, S. Wu, Song, M. S. Lew, DeepIndex for Accurate and Efficient Image Retrieval, ICMR, pp. 43–50, 2015
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, Fei-Fei Li, ImageNet Large Scale Visual Recognition Challenge, IJCV, Vol. 115, No. 3, pp.211-252, 2015
-  J. Wan, D. Wang, S. H. Hoi, P. Wu, J. Zhu, Y. Zhang, J. Li, Deep Learning for Content-Based Image Retrieval: A Comprehensive Study, ACM Multimedia, pp.157–166, 2014
-  A. Babenko, Artem A. Slesarev, A. Chigorin, V. Lempitsky, Neural Codes for Image Retrieval, ECCV, pp.584–599, 2014
-  S. Bell, K. Bala, Learning Visual Similarity for Product Design with Convolutional Neural Networks, SIGGRAPH, 2015
F. Schroff, Florian, D. Kalenichenko, J. Philbin, FaceNet: A Unified Embedding for Face Recognition and Clustering, CVPR, pp.815–823, 2015
-  H. O. Song, Y. Xiang, S. Jegelka, S. Savarese, Deep Metric Learning via Lifted Structured Feature Embedding, CVPR, pp.4004–4012, 2016
-  K. Sohn, Improved Deep Metric Learning with Multi-class N-pair Loss Objective, NIPS, pp.1857–1865, 2016
-  Hyun Oh Song, S. Jegelka, V. Rathod, K. Murphy, Deep Metric Learning via Facility Location CVPR, pp.2206–2214, 2017
-  Y. LeCun, L. Bottou and Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, Vol. 86, No.11 pp.2278-2324, 1998
-  R. Hadsell, S. Chopra, Y. LeCun, Dimensionality reduction by learning an invariant mapping, CVPR, pp.1735–1742, 2006
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going Deeper with Convolutions, CVPR, pp.1–9, 2015
-  A. Krizhevsky, I. Sutskever, G. E. Hinton ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 1097–1105, 2012
-  K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, ICLR, 2015
-  K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, CVPR, pp. 770–778, 2016
-  Q. Qian, R. Jing, S. Zhu, Y. Lin, Fine-grained visual categorization via multi-stage metric learning, CVPR, pp.3716–3724, 2015
-  J. Bromley, I. Guyon, Y. LeCun, S. Eduard, R. Shah, Signature Verification using a ”Siamese” Time Delay Neural Network, NIPS, pp. 737–744, 1994
-  S. Chopra, R. Hadsell, Y. LeCun, Learning a similarity metric discriminatively, with application to face verification, CVPR, pp. 539–546, 2005
-  Large scale online learning of image similarity through ranking, G. Chechik, V. Sharma, U. Shalit, S. Bengio, JMLR, pp.1109–1135, 2010
-  Y. Taigman, M. Yang, M. Ranzato, L. Wolf, DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR, pp. 1701-1708, 2014
-  C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The Caltech-UCSD Birds-200-2011 Dataset, California Institute of Technology, CNS-TR-2011-001, 2011
-  J. Krause, M. Stark, J. Deng, Fei-Fei Li, 3D Object Representations for Fine-Grained Categorization, 4th International IEEE Workshop on 3D Representation and Recognition, pp.554-561, 2013
-  L. Wei, Q. Huang, D. Ceylan, E. Vouga, H. Li, Dense Human Body Correspondences Using Convolutional Networks, CVPR, pp.1544–1553, 2016
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional Architecture for Fast Feature Embedding arXiv preprint arXiv:1408.5093, 2014
-  C. D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008
-  H. Jegou, M. Douze, C. Schmid, Product Quantization for Nearest Neighbor Search, IEEE Trans. PAMI, Vol. 33, No.1 pp.117-128, 2011