1 Introduction
With the progress of deep learning
[1, 2, 3], deep embedding learning has received a lot of attention and has been applied in a wide range of tasks and applications, including image retrieval and clustering [4, 5, 6, 7], pattern verification [8, 9, 10, 11] and domain adaptation [12, 13]. Deep embedding learning intends to learn a feature representation of the input image that preserves the distance between similar data points small and dissimilar data points large in the feature space.In deep embedding learning community, most remarkable works are based on contrastive loss [8, 14, 15, 11] and triplet loss [9, 5, 7, 10]
. And it is a common knowledge that hard example mining is crucial to ensure the quality and efficiency of these above methods, since the overly easy examples can satisfy the constraint well and then produce nearly zero loss, without contributing to the parameter update during backpropagation. Nevertheless, many hard example mining methods require much computational cost when measuring the embedding vectors in feature space, and they are performancesensitive, e.g. the hardclass mining procedure in Npair loss
[16].To alleviate the issue above and, to learn compact intraclass distance and separable interclass distance, we introduce the concept of large margin constraint into Npair loss instead of hardclassmining. Some existing works [20, 21]
have focused on the learning of discriminative embedding via injecting large margin constraints into KNN and softmax, respectively. However, they exert nonadaptive constraint on the objective loss by introducing a fixed margin which is not suitable for the heterogeneous and multimodal feature distribution.
Figure 1 illustrates the comparison between feature distribution on finegrained bird dataset [18] and MNIST dataset [19]
. It is obviously observed that the diversity of embedding representation on bird dataset is prominent, where the intraclass distance can be larger than the interclass distance and the distribution is heterogeneous, different from the ’uniform’ distribution in MNIST. And in real cases, the distribution of feature space is complex due to pose and appearance
[22]. Thus, a consequent problem is that stronger margin constraint can be used for easy patterns while it is infeasible to hard patterns^{1}^{1}1Easy/hard patterns refer to where the intraclass distance is smaller/larger than the interclass distance.. And that is why coarsely imposing fixed constraint can not only be hard to improve the performances, but probably lead to the failure of training. Thus, introducing a prudent and localadaptive margin constraint is of the essence.
In this paper, we propose Adaptive Large Margin Npair loss (ALMN) to address the aforementioned issues, producing discriminative embedding under heterogeneous feature distribution in multimodal cases. It is mainly achieved by introducing an adaptive margin constraint in terms of local embedding representation structure. And as an extension to Npair loss [16], our method optimize the angular distance between samples as well. which is rotationinvariant and scaleinvariant by nature. Furthermore, the adaptive large margin constraint is tactfully constructed by a novel technique of Virtual Point Generating (VPG), factitiously mapping a well learned positive data point to a far place. Then, by optimizing this virtually generated new point well, a large angular margin can be obtained. Moreover, the strength of margin constraint induced by VPG for individual pattern is adjustable, quantified by hyperparameter . With bigger
, the ideal margin between samples becomes larger. Our ALMN is a flexible learning objective, and can be easily used as a dropin loss function in the endtoend frameworks and combined with any other hard example mining strategies. To our best knowledge, it is the first work to introduce margin constraint by generating virtual data point for deep embedding learning, virtual point generating is also an open question, in this work, we simply consider a geometrical way. Image retrieval and clustering experiments have been performed on several datasets, including CUB2002011
[18], CARS196 [23], Flowers102 [24], Aircraft [25] and Stanford Online Products [5].2 Related Work
The very key goal of deep embedding learning is to learn a feature representation that keeps the distance between related data points small and unrelated data points large on the feature space. Some research works jointly optimize contrastive loss and softmax loss for the purpose of discriminative feature learning, such as DeepID2[8] and DeepID2+[26]. Facenet [9] proposes triplet loss to improve the ability of deep embedding learning without jointly training with softmax loss. And many remarkable works use tripletbased objective loss to optimize deep frameworks in many tasks [10, 27, 7, 5]. Lifted structure embedding [5] encourages that each positive pair compares the distances against all the negative pairs in one minibatch, aiming to make full use of the minibatch. To avoid the convergence at bad local optimum, it optimizes a smooth upper bound function of nested max functions. Local SimilarityAware [22] generalizes triplet loss to a quadrupletlike loss and selects hard samples by PDDM units. Npair loss [16] expands the idea of triplet or quadruplet tuple to Npair tuple, and enforces softmax crossentropy loss among the pairwise similarity values in the batch. We share the similar core with Npair that takes all negative samples in the current minibatch into consideration, but as an extension, our ALMN can lead more discriminative embedding even without hardsample mining, as a consequence of adaptive large margin learning.
The performances of most aforementioned research works are sensitive to the selected example pairs. Selecting genius hard samples to construct a training batch can significantly improve the quality of learning, but it also incurs much computational cost. However, our ALMN does not require hardclass mining (adopted in original Npair loss), and thus allows the training of discriminative embedding with a lower computational cost.
There are some other works aiming at learning discriminative embedding feature. Large Margin Nearest Neighbor (LMNN) [20] optimizes the Mahalanobis metric for nearest neighbor classification. Recently, Large Margin Softmax (LSoftmax) [21] encourages the angular decision margin between classes. While, it is designed for Softmax and the margin constraints are the same for any patterns, e.g. doubleangle constraint for both easy and hard patterns, thus maybe unsuitable for multimodal feature space, and the convergence of model is slow. Our ALMN allows localadaptive margin constraint and can be successfully applied in multimodal cases.
And some other works emerge in the deep embedding learning community. Clustering[28] formulates the NMI as the objective function and optimizes it in deep models. HDC[29] employs the cascaded models and selects hardsamples from different levels and models. Smartmining[30] combines local triplet loss and global loss to optimize the deep metric with hardsamples mining. SamplingMatters[31] proposes distance weighted sampling strategy and use a much stronger deep model(Res50) than most existing methods. Angular loss[32] optimize a trianglebased angular function. BIER loss[33]
adopts ensemble learning framework of online gradients boosting which is totally different from our method that belongs to single feature learning family. ProxyNCA
[34] explains why popular classification loss works from proxyagent view, and the implementation is very similar with Softmax. In summary, different from the above methods that investigate ways of informative samples mining or feature ensemble, we mainly focus on introducing an open question, i.e. VPG, to impose large margin constraint so as to improve the discrimination of deep embedding leaning.3 Adaptive Large Margin Npair Loss
In deep embedding learning, our goal is to learn a deep feature embedding
from input image into a feature vector , such that the similarity between and is higher when they belong to the same class and is lower when they belong to different classes, where refers to the feature vector of image . To ensure the intraclass compactness and the interclass separability, we introduce large margin constraint instead of exploring samplemining strategy. One related work LSoftmax [21] uses a preset and fixed angular margin constraint to enlarge the margin between classes. While in practical vision tasks, the embedding distribution always exhibits a character of multimode due to pose and appearance [22], therefore, a fixed margin constraint is not suitable. Specifically, a relatively weaker constraint will contribute little to the optimization of easy patterns, while a rigorous constraint might be too strong to guide the training of hard patterns. Under multimodal situation, the learning of discriminative feature embedding by injecting an applicable margin constraint could suit the remedy to the case. Therefore, we propose Adaptive Large Margin Npair loss (ALMN) that can meet the needs of multimodal feature distribution. Below, we first give a review of Npair loss, then introduce our basic objective function, and finally show the mainstay of ALMN, i.e. Virtual Point Generating.3.1 Review of Npair Loss and Preliminaries
Npair loss [16] points out that simultaneously optimizing with multiple negative samples can be regarded as an approximation of ’global optimization’ and thus can improve the performances. It is formulated as follows:
(1) 
where is a regularization constant for norm and is the minibatch size. refer to the positive point, anchor point and negative points respectively. Moreover, when minimizing Eq. 1, the optimization of innerproductbased softmaxlike function is implicit to optimize the angle between samples, since the similarity based on inner product can be rewritten into , and in order to correctly separate from , Npair loss is to force , i.e. , where is the angle between and , and this optimization is mainly determined by , verified by LSoftmax [21].
3.2 Basic Objective Function based on Centers
From minimizing Eq. 1, it can be observed that we would like to force (i.e. ) in order to correctly separate from , in another word we intend to push close to and pull far from . Apparently, the reasonability of location of the anchor point determines the stability of model training, since anchor point affects the gradients direction, and unstable direction will impede the stability of model training. To this end, we adopt class center instead of random positive sample as our anchor point . While, it is impossible to update the class centers with respect to the entire training set during each iteration. We share a similar idea with [35] that performs the update on the basis of minibatch. At each iteration, the class centers are updated as follows:
(2) 
where if the condition is satisfied, and if not. is the learning rate. Finally, our basic objective loss is as follows:
(3) 
3.3 Virtual Point Generating
However, without hardclass mining, the constraint ^{2}^{2}2For simplicity, here, we consider the problem of binary class, where label . Multiclassification complicates our analysis but has the same mechanism as binary scenario. can hardly satisfy our demands of discriminative embedding learning, since it can be easily satisfied and hence stop contributing to parameter update, as shown in Fig.2.(a) where the decision boundaries for two classes are overlapped, yielding separable but not discriminative features. Inspired by LSoftmax [21], optimizing a rigorous objective is to produce more rigorous decision boundaries and larger decision margin, we propose Virtual Point Generating (VPG) to enhance the constraint by generating virtually localhard point , this constraint based on the generated point is more suitable in multimodal space than LSoftmax, producing an adaptive decision margin. Here, we will first introduce the general concept of our VPG and then will explain how to make it adaptive. Since the training of Eq. 3 is based on angular optimization, is thus generated in the angular manner, and to keep the same amplitude as , we formulate as follows:
(4) 
As shown in Fig.2, vector has the same direction with and affects the location of , here we do not focus on its specific value which will be investigated later. is a hyperparameter to further control the location of . From the right chart of Figure 2 (), it can be observed that the new generated data point has a larger angular distance to the anchor point than . Therefore, to make a more rigorous decision boundary, we instead require
(5) 
Due to the geometrical relationship in Figure 2, always hold, if we can optimize , then will spontaneously hold. So the new objective (i.e. Eq.5) is a stronger constraint (requirement) to correctly separate from , producing more rigorous decision boundaries.
As illustrated in the left chart of Figure 2, optimizing the objective , which is implicitly with a stronger margin constraint, is to produce a large angular decision margin between classes, and to encourage both intraclass compactness and interclass separability. Specifically, as in Fig.2.(a) before VPG, when the training loss get to a stable level, the data points in feature space have no need to move further because they have satisfied the constraint well, however after VPG as shown in Fig.2.(b), is mapped to a boundary point or even much harder point in feature space, i.e. , so as to correctly separate from , the new decision boundary is produced, and it will further push as well as towards and far from in angular manner, yielding more compact intraclass and separable interclass angular distributions. Moreover, naturally inferred from Figure 2, by increasing to a larger value (e.g. ), a farther is generated, in another word, a more rigorous objective is to be optimized and thus in ideal case a more discriminative embedding can be achieved.
Adaptive Margin: Without loss of generality, we consider . As mentioned above, our goal is to make an adaptive large margin constraint, and from Eq. 4 one can observe that is mainly determined by the vector . Hence, vector should be localadaptive such that the margin constraint based on is applicable for each case, e.g. hard and easy patterns. Specifically, considering the local feature space, vector should satisfy (as in Figure 2), where is the angle between and its nearest negative vector , and is the angle between and . In summary, since is based on , with considering the local feature structure, the margin constraint introduced by is adaptive, in another word, easy patterns (larger and smaller , i.e. larger ) can be equipped with relatively stronger constraint, and hard patterns (smaller and larger , i.e. smaller ) with weaker constraint. As a consequence, the margin constraint is adaptive.
To generate , we need to compute the specific value of . However, it is not our focus and its specific value does not matter. Since, in practical application, we adopt random sampling instead of hard negative sample mining and only one minibatch is fed into the network each iteration, so in one minibatch is not globally optimal and is always much farther, resulting in a bigger (bigger ), i.e. bigger , in another word a farther and nonlocal margin constraint are introduced. As a consequence, the training will be hard and even get failure. We address this challenge by empirically and experimentally constructing a lower bound vector ^{3}^{3}3The lower bound vector has the same direction with the original vector, yet smaller amplitude. of , i.e. , as follows:
(6) 
Proposition 1
is a lower bound vector of as illustrated in Figure 3.
Proof
We provide a explicit geometric interpretation of this lower bound vector . As shown in Figure 3.(a), since , and according to the Cosine Law, in , , Additionally, and are on one concentric circle and easy to prove , according to the Sine Law, in , . So always holds and from Eq. 6 we have , thus and, vector and have the same direction with . In conclusion, in Eq. 6 can be regarded as a lower bound vector of .
Replacing in Eq. 4 with the lower bound vector , we can obtain a more stable as depicted in Figure 3.(b) and formulate it as follows:
(7) 
where , it addresses the problem of lessthanideal angular constraint to some extent, which is caused by random sample mining and minibatch training. We experimentally find that it indeed works well and also allows the stability of network optimizing.
Overall Objective: to optimize the new rigorous objective , we follow Npair loss and formulate it as the following one, i.e. our ALMN loss:
(8) 
where is shown in Eq. 7. Obviously when , and we make it as our baseline. The ALMN can be easily optimized by commonly used SGD and BP algorithm. The gradients with respect to and are listed as follows:
(9)  
(10) 
(11) 
Finally, we show ALMN in Algorithm.1. Most worthy of mention is that we introduce the novel concept of VPG to enhance the margin constraint, i.e. generating a virtually boundary point and optimizing instead of the original . While our VPG does not limit the specific formulation of , we leave it as an open question and there can be other ways to generate , here for geometrical interpretation, we simply take Eq.4 and 7.
4 Discussion
The ALMN loss encourages an adaptive large angular margin among classes by a novel constraint constructing method VPG. It has some nice properties:

The core of VPG is to enhance the margin constraint by generating virtually hard points. And the holistic margin constraint can be controlled by hyperparameter . With bigger , the ideal margin between classes becomes larger, yielding more discriminative embedding.

For any fixed , the angular margin constraint induced by VPG is localadaptive and varies across instances, since the virtual point is generated on the basis of local feature structure. Thus, easy patterns can be supervised by stronger constraint, and hard patterns will be optimized under the relatively weaker constraint.

Our VPG is a generic method that can be easily combined with any other hardsamplemining methods and model architectures.
Comparison to Npair loss: as an extension to Npair loss [16], our ALMN has two advantages. First, by employing class centers as the anchor points instead of random positive points, the optimization of our ALMN is more stable and ideal than Npair loss due to the correct direction of gradients, and thus the performance of deep embedding learning can be improved, verified by the results comparison between ALMN () and Npair loss in Table. 2 and 3. Second, and which is our most contribution, ALMN (e.g. ) can significantly encourage a large angular decision margin among classes, yielding more discriminative feature embedding than Npair loss, and it is mainly achieved by the novel and generic VPG method. Furthermore, our ALMN does not require hardclass mining procedure which is adopted to construct the training batches in Npair loss.
Comparison to other constraint losses: NoisySoftmax [36] imposes annealed noise on Softmax which aims to improve the generalization ability of DCNNs. Our ALMN has a similar goal with [21, 37] that enhancing the discriminative property of learned features by exerting constraint on objective function. However, in [21]
, the constraint is specifically designed for Softmax layer, and the strength of margin constraint behind the optimization objective
are the same for each samples (e.g. m=2), and this fixed mtimesangle constraint is not applicable under heterogeneous feature distribution. In contrast, our ALMN is towards deep embedding learning, for a certain , our margin constraint behind is localadaptive, since the virtual point is generated on the basis of its neighbouring feature space not a fixed scale. And, the margin constraint of ALMN is introduced by generated virtual point which is different from directly setting in [21].Ablation study: to highlight the effectiveness of our localadaptive large margin constraint, we conduct a contrast test by modifying our basic objective function (Eq. 3) to a LSoftmaxlike loss, which is of the fixed angular margin constraint, as follows:
(12) 
where is the same as in Lsoftmax. Then, we train the same CNN model with Eq. 12 () and Eq. 8 (), respectively. From Figure 7, we can observe that the training loss of LSoftmax() stops reducing at a higher level, implying it does not converge, and we infer that the doubleangle constraint may be much stronger for some examples (e.g. hard patterns) and this phenomenon will disturb the overall training process. While, the loss of our ALMN drops fast to a relatively low level, demonstrating that the localadaptive angular margin constraint can be well optimized and thus is indeed crucial to address the problem of discriminative embedding learning in multimodal cases.
5 Experiments and Results
To demonstrate the effectiveness of our proposed ALMN under multimodal scenarios, we evaluate it on image clustering and retrieval tasks over several benchmark datasets, which present varieties of variations such as in pose and appearance. Notably, except class label we do not use any other annotation information such as bounding box or part annotation.
5.1 Implementation Details
For network configuration, we use the ImageNet pretrained GoogLeNet
[3] for initialization and finetune it on our target datasets. The last fully connected layer is initialized with random weights and we fix the embedding feature size at throughout all of our experiments(since the performance doesn’t change much when varying embedding sizes according to [5]). We set dropout ratio to . For fair comparison, we follow the same data preprocess method as adopted in [5], i.e. all the training and testing images are processed into and then mean subtraction is performed. For data augmentation, all training images are randomly cropped toand randomly mirrored. All of our experiments are implemented by Caffe library
[38] with our own modifications.As we mentioned in the above section, we do not perform hardclass mining. Instead, we construct a random batch in manner, where and denote the number of classes and the number of samples in each class, respectively. Note that, the classes and samples are all randomly selected. And we will investigate the affects of different combinations of and in the following subsection.
Training: The initial learning rate is and multiplied by at iteration. However, the total iterations are and for (CUB, Flowers, Aircraft) and CARS196, respectively. We use a weight decay of and momentum of . Moreover, the regularization constant for norm is and we use 10 times learning rate for the feature layer.
Evaluation: The same as many other research works [5, 16, 22], we use the and NMI metrics for image clustering task and the Recall@K metric for image retrieval task. We use the simple cosine distance for the evaluation of the embedding feature. We make ALMN (), which means training without VPG (i.e. ), as our baseline. For comparison, we evaluate many existing methods, and implement some of them with the same network and training configurations as ours, including triplet loss [9], lifted structured embedding [5] and Npair loss [16].
5.2 Component Analysis
Minibatch combination: To acquire a stable location of the anchor point, we employ the class center . However, we experimentally found that the combination of minibatch is important to the update of .
Inspired by Npair loss, we construct a minibatch, where and denote the number of classes and the number of the samples in each class, respectively. Throughout our experiments the value of is fixed, and we can imagine that, as increases, there are more and more positive samples to contribute to the update of at the same time, resulting in a more stable and more real class center. However, when is large enough and , i.e. in negative sample limit, there is no contribution from negative samples and thus the interclass separability will not be guaranteed.
CUB2002011  Cars196  
65 x 2  26 x 5  16 x 8  8 x 16  65 x 2  26 x 5  16 x 8  8 x 16  
Recall@K=1  51.1  52.4  52.1  51.1  64.2  71.6  69.7  68.8 
Recall@K=2  63.7  64.8  64.4  64.0  75.2  81.3  80.5  79.1 
Recall@K=4  74.5  75.4  75.6  74.6  83.7  88.2  88.3  86.4 
Recall@K=8  83.6  84.3  84.3  84.0  90.0  93.4  92.8  92.3 
F1  27.2  28.5  27.5  28  24.6  29.4  26.9  25.3 
NMI  59.7  60.7  59.6  60.3  57.9  62.0  60.9  58.6 
We evaluate the performances of the ALMN loss with different combinations of on CUB2002011 [18] and CARS196 [23]. And the experimental results are listed in Table 1. From the results, one can observe that the performances are different when using various combinations of , where the total batch sizes are almost the same. As we analyzed above, a relatively appropriate combination of is required, which is important for stable training and discriminative embedding learning. And we use the combination of in the following subsections. Notably, although we need to construct the minibatch according to some protocol, the selection is totally random and there is no computational cost since there is no demand to evaluate the embedding vectors in deep learning framework, which is different from hardclass mining in Npair loss.
Enlarging angular margin: We can further enhance the angular margin constraint by increasing parameter such that a larger decision margin among classes can be produced and the more discriminative embedding can be achieved. From Table 2, when in zero constraint limit, our baseline algorithm obtain relatively lower results. Then one can observe that, when our ALMN can significantly improve nearly and R@1 accuracies over CUB and CARS datasets respectively, verifying the effectiveness of the adaptive large margin constraint induced by VPG. Afterwards, it can further improve the performances over all datasets by increasing e.g. , demonstrating our initial thought that larger decision margin among classes will encourage the learning of discriminative embedding. Likewise, the improvements can also be found in other datasets as in Table.3 4.
5.3 Comparison with Stateoftheart
CUB2002011  Cars196  
R@1  R@2  R@4  R@8  F1  NMI  R@1  R@2  R@4  R@8  F1  NMI  
Google[3]  40.8  53.8  67.0  78.2  18.0  51.5  35.5  47.5  58.9  71.5  8.6  37.1 
Triplet[9]  36.1  48.6  59.3  70.0  15.1  49.8  39.1  50.4  63.3  74.5  16.8  51.4 
Lifted[5]  47.2  58.9  70.2  80.2  21.2  55.6  49.0  60.3  72.1  81.5  21.8  55.0 
Clustering[28]  48.2  61.4  71.8  81.9    59.2  58.1  70.6  80.3  87.8    59.4 
Smining[30]  49.8  62.3  74.1  83.3    59.9  64.7  76.2  84.2  90.2    59.5 
Angular[32]  53.6  65.0  75.3  83.7  30.2  61.0  71.3  80.7  87.0  91.8  31.8  62.4 
Npair[16]  49.1  61.2  72.7  82.1  25.9  58.5  63.6  74.7  84.1  90.1  23.9  57.4 
Proxy NCA[34]  49.2  61.9  67.9  72.4    59.5  73.2  82.4  86.4  88.7    64.9 
ALMN ()  50.4  62.7  73.5  82.9  27.6  59.4  66.2  76.7  85.1  91.4  23.6  56.7 
ALMN ()  52.0  64.5  74.8  83.7  28.2  60.2  70.4  80.4  87.3  92.5  26.3  59.3 
ALMN ()  52.2  64.7  75.3  84.2  28.2  60.7  71.3  81.2  88.1  93.1  28.3  61.5 
ALMN ()  52.4  64.8  75.4  84.3  28.5  60.7  71.6  81.3  88.2  93.4  29.4  62.0 
CUB2002011 dataset [18] includes 11,788 bird images coming from 200 classes. We use the first 100 classes for training (5,864 images) and the rest 100 classes for testing (5,924 images). We list our experimental results together with those of other stateoftheart methods in Table 2. From the results, one can observe that our baseline ALMN () outperforms Npair loss even without large margin constraint, demonstrating that the reasonability of location of the anchor point can not only make training stable but improve the performance. And by introducing an adaptive large angular margin constraint among classes, our ALMN () can significantly improve the performances and also outperforms most existing methods, even achieving the comparable results compared to the stateoftheart methods, and thus verifying the effectiveness of our adaptive large margin constraint.
CARS196 dataset [23] includes 16,185 car images coming from 196 classes. We split the first 98 classes for training (8,054 images) and the rest 98 classes for testing (8,131 images). We list our experimental results together with that of other stateoftheart methods in Table 2. From the results, it can be observed that ALMN () shows the better performances than Npair loss, demonstrating the superiority of our choice of the anchor point . Then, ALMN () can significantly improve nearly R@1 result over the baseline ALMN and also outperforms most of the other existing methods, obtains comparable results compared to stateoftheart, verifying the effectiveness of our method.
Flowers102  Aircraft  
R@1  R@2  R@4  R@8  F1  NMI  R@1  R@2  R@4  R@8  F1  NMI  
Googlenet[3]  80.5  87.6  92.9  95.7  41.0  63.8  42.0  52.8  64.2  75.6  10.3  30.0 
Triplet[9]  80.3  87.2  92.0  95.7  41.3  64.0  41.8  53.5  64.4  75.3  10.7  31.3 
Lifted[5]  82.6  89.4  93.1  96.0  43.3  65.9  53.8  67.5  77.7  85.5  23.8  51.9 
npair[16]  83.3  89.9  93.9  96.4  43.2  66.1  56.1  69.0  80.2  87.7  24.7  52.4 
ALMN()  85.3  91.4  94.7  97.2  53.1  71.5  63.5  74.2  83.3  90.0  25.7  53.3 
ALMN()  88.8  93.1  95.9  98.1  56.3  75.7  67.0  78.1  86.6  91.3  29.5  56.2 
ALMN()  89.5  93.8  96.3  98.0  56.6  75.9  67.9  79.3  87.0  91.8  30.4  57.2 
ALMN()  90.1  94.0  96.6  98.2  57.0  76.2  68.4  79.9  87.2  92.0  30.7  57.9 
Flowers102 The Flowers102 dataset [24] includes 8189 flower images from 102 classes. Each class consists of between 40 and 258 images. We split the first 51 classes for training (3493 images) and the rest 51 classes for testing (4696 images). We implement triplet loss [9], lifted structured embedding [5] and npair loss [16] with the same network and training configurations as ours and test them with the single crop. From the results shown in Table. 3, our baseline ALMN() outperforms other works by adopting a stable anchor point. And ALMN() can further improve the performances for image clustering and retrieval tasks by learning a discriminative embedding with adaptive large margin constraint, demonstrating the superiority of our method.
Aircraft The Aircraft dataset [25] has 100 classes of aircrafts with 10,000 images. We split the first 50 classes for training (5,000 images) and the other 50 classes for testing (5,000 images). We also implement triplet loss [9], lifted structured embedding [5] and npair loss [16] with the same network and training configurations as ours and then test them with the single crop. From the results shown in Table. 3, our baseline ALMN() outperforms other works by adopting a stable anchor point. And ALMN() can further improve nearly and for image retrieval and clustering (F1) tasks respectively by learning a discriminative embedding with adaptive large margin constraint.
Lifted[5]  npair[16]  Clustering[28]  Angular[32]  HDC[29]  BIER[33]  ALMN()  ALMN()  
ensemble  
R@1  62.5  66.4  67  67.9  69.5  72.7  69.3  69.9 
R@10  80.8  83.2  83.6  83.2  84.4  86.5  84.5  84.8 
R@100  91.9  93  93.2  92.2  92.8  94  92.7  92.8 
Stanford Online Products dataset[5] has images of online classes and each class has images on average. Following the zeroshot protocol, we also split the first classes for training and the remaining classes for testing. We show our final results in Table.4. One can observe that our method () achieves appealing results compared to other singlefeature methods and the ensemblefeature method(e.g. HDC[29] and BIER[33], ensemble is well known better than single feature).
5.4 Cases Study
To show the results of discriminative embedding learning under multimodal scenario, we provide some cases over CUB2002011 [18] and Cars196 [23] datasets in Figure 8. From the comparison between top1 positive and top1 negative retrieval, it can be observed that the image is correctly retrieved by our algorithm. Then by introducing the adaptive large margin constraint among classes, our ALMN () can significantly increase the similarity score between the query and top1 positive retrieval images, implying that the intraclass compactness is strengthened. And from the results of top1 negative retrieval results, one can observe that our ALMN () can significantly reduce the similarity score between the query and top1 negative sample, demonstrating that our method produces a more separable interclass distance.
6 Conclusion
In this paper, we propose ALMN to address the problem of discriminating feature learning in multimodal feature space. It encourages intraclass compactness and interclass separability by enlarging the angular decision margin among classes. And the prudent margin constraint is localadaptive. Moreover, the novel concept of VPG gives chances of discriminative embedding learning without hardexample mining, and the virtual point generating method is an open question which may benefit the community. Extensive quantitative and qualitative results demonstrate the effectiveness of our proposed method.
References

[1]
Krizhevsky, A., Sutskever, I., Hinton, G.E.:
Imagenet classification with deep convolutional neural networks.
In: Advances in neural information processing systems. (2012) 1097–1105  [2] Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556 (2014)

[3]
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
Vanhoucke, V., Rabinovich, A.:
Going deeper with convolutions.
In: Computer Vision and Pattern Recognition. (2014) 1–9
 [4] Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: Cnn architecture for weakly supervised place recognition. In: IEEE Conference on Computer Vision and Pattern Recognition. (2016) 5297–5307
 [5] Oh Song, H., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 4004–4012
 [6] Hershey, J.R., Chen, Z., Roux, J.L., Watanabe, S.: Deep clustering: Discriminative embeddings for segmentation and separation. In: IEEE International Conference on Acoustics, Speech and Signal Processing. (2015) 31–35
 [7] Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: International Workshop on SimilarityBased Pattern Recognition, Springer (2015) 84–92
 [8] Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint identificationverification. In: Advances in neural information processing systems. (2014) 1988–1996

[9]
Schroff, F.e.a.:
Facenet: A unified embedding for face recognition and clustering.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 815–823  [10] Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: British Machine Vision Conference. (2015) 41.1–41.12
 [11] Yi, D., Lei, Z., Liao, S., Li, S.Z.: Deep metric learning for person reidentification. In: International Conference on Pattern Recognition. (2014) 34–39
 [12] Tahmoresnezhad, J., Hashemi, S.: Visual domain adaptation via transfer feature learning. Knowledge and Information Systems (2016) 1–21
 [13] Long, M., Wang, J., Ding, G., Sun, J.: Transfer joint matching for unsupervised domain adaptation. In: IEEE Conference on Computer Vision and Pattern Recognition. (2014) 1410–1417
 [14] Lin, J., Morere, O., Chandrasekhar, V., Veillard, A., Goh, H.: Deephash: Getting regularization, depth and finetuning right. Mccarthy (2015)
 [15] SimoSerra, E., Trulls, E., Ferraz, L., Kokkinos, I.: Discriminative learning of deep convolutional feature point descriptors. In: IEEE International Conference on Computer Vision. (2015) 118–126
 [16] Sohn, K.: Improved deep metric learning with multiclass npair loss objective. In: Advances in Neural Information Processing Systems. (2016) 1857–1865

[17]
Laurens, V.D.M.:
Accelerating tsne using treebased algorithms.
Journal of Machine Learning Research
15(1) (2015) 3221–3245  [18] Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltechucsd birds2002011 dataset. California Institute of Technology (2011)

[19]
Lecun, Y., Cortes, C.:
The mnist database of handwritten digits.
(2010)  [20] Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10(1) (2006) 207–244
 [21] Liu, W., Wen, Y.: Largemargin softmax loss for convolutional neural networks. In: ICML. (2016)
 [22] Huang, C., Loy, C.C., Tang, X.: Local similarityaware deep feature embedding. In: Advances in Neural Information Processing Systems. (2016) 1262–1270
 [23] Krause, J., Stark, M., Deng, J., FeiFei, L.: 3d object representations for finegrained categorization. In: IEEE International Conference on Computer Vision Workshops. (2013) 554–561
 [24] Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing. (Dec 2008)
 [25] Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Finegrained visual classification of aircraft. HAL  INRIA (2013)
 [26] Sun, Y., Wang, X., Tang, X.: Deeply learned face representations are sparse, selective, and robust. In: IEEE Conference on Computer Vision and Pattern Recognition. (2014) 2892–2900
 [27] Qian, Q., Jin, R., Zhu, S., Lin, Y.: Finegrained visual categorization via multistage metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3716–3724
 [28] Song, H.O., Jegelka, S., Rathod, V., Murphy, K.: Deep metric learning via facility location. In: Computer Vision and Pattern Recognition (CVPR). (2017)
 [29] Yuan, Y., Yang, K., Zhang, C.: Hardaware deeply cascaded embedding. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
 [30] Kumar, V.B., Harwood, B., Carneiro, G., Reid, I., Drummond, T.: Smart mining for deep metric learning. arXiv preprint arXiv:1704.01285 (2017)
 [31] Wu, C.Y., Manmatha, R., Smola, A.J., Krahenbuhl, P.: Sampling matters in deep embedding learning. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
 [32] Wang, J., Zhou, F., Wen, S., Liu, X., Lin, Y.: Deep metric learning with angular loss. arXiv preprint arXiv:1708.01682 (2017)
 [33] Opitz, M., Waltner, G., Possegger, H., Bischof, H.: Bier  boosting independent embeddings robustly. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
 [34] MovshovitzAttias, Y., Toshev, A., Leung, T.K., Ioffe, S., Singh, S.: No fuss distance metric learning using proxies. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
 [35] Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision, Springer (2016) 499–515
 [36] Chen, B., Deng, W., Du, J.: Noisy softmax: Improving the generalization ability of dcnn via postponing the early softmax saturation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017)
 [37] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Volume 1. (2017)
 [38] Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan: Caffe: Convolutional architecture for fast feature embedding. Eprint Arxiv (2014) 675–678
Comments
There are no comments yet.