ALMN: Deep Embedding Learning with Geometrical Virtual Point Generating

06/04/2018 ∙ by Binghui Chen, et al. ∙ 0

Deep embedding learning becomes more attractive for discriminative feature learning, but many methods still require hard-class mining, which is computationally complex and performance-sensitive. To this end, we propose Adaptive Large Margin N-Pair loss (ALMN) to address the aforementioned issues. Instead of exploring hard example-mining strategy, we introduce the concept of large margin constraint. This constraint aims at encouraging local-adaptive large angular decision margin among dissimilar samples in multimodal feature space so as to significantly encourage intraclass compactness and interclass separability. And it is mainly achieved by a simple yet novel geometrical Virtual Point Generating (VPG) method, which converts artificially setting a fixed margin into automatically generating a boundary training sample in feature space and is an open question. We demonstrate the effectiveness of our method on several popular datasets for image retrieval and clustering tasks.



There are no comments yet.


page 10

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the progress of deep learning

[1, 2, 3], deep embedding learning has received a lot of attention and has been applied in a wide range of tasks and applications, including image retrieval and clustering [4, 5, 6, 7], pattern verification [8, 9, 10, 11] and domain adaptation [12, 13]. Deep embedding learning intends to learn a feature representation of the input image that preserves the distance between similar data points small and dissimilar data points large in the feature space.

In deep embedding learning community, most remarkable works are based on contrastive loss [8, 14, 15, 11] and triplet loss [9, 5, 7, 10]

. And it is a common knowledge that hard example mining is crucial to ensure the quality and efficiency of these above methods, since the overly easy examples can satisfy the constraint well and then produce nearly zero loss, without contributing to the parameter update during back-propagation. Nevertheless, many hard example mining methods require much computational cost when measuring the embedding vectors in feature space, and they are performance-sensitive, e.g. the hard-class mining procedure in N-pair loss


Figure 1: Visualization (by t-SNE [17]) of the deep embedding on the test splits of (a) CUB-200-2011 [18] (5924 images from class 101 to 200) and (b) MNIST [19]. In (a), the intra-class distance can be larger than the inter-class distance, and the distribution is heterogeneous and multimodal. While in (b), the distribution is ’uniform’ and ideal.

To alleviate the issue above and, to learn compact intra-class distance and separable inter-class distance, we introduce the concept of large margin constraint into N-pair loss instead of hard-class-mining. Some existing works [20, 21]

have focused on the learning of discriminative embedding via injecting large margin constraints into KNN and softmax, respectively. However, they exert non-adaptive constraint on the objective loss by introducing a fixed margin which is not suitable for the heterogeneous and multimodal feature distribution.

Figure 1 illustrates the comparison between feature distribution on fine-grained bird dataset [18] and MNIST dataset [19]

. It is obviously observed that the diversity of embedding representation on bird dataset is prominent, where the intra-class distance can be larger than the inter-class distance and the distribution is heterogeneous, different from the ’uniform’ distribution in MNIST. And in real cases, the distribution of feature space is complex due to pose and appearance

[22]. Thus, a consequent problem is that stronger margin constraint can be used for easy patterns while it is infeasible to hard patterns111Easy/hard patterns refer to where the intra-class distance is smaller/larger than the inter-class distance.

. And that is why coarsely imposing fixed constraint can not only be hard to improve the performances, but probably lead to the failure of training. Thus, introducing a prudent and local-adaptive margin constraint is of the essence.

In this paper, we propose Adaptive Large Margin N-pair loss (ALMN) to address the aforementioned issues, producing discriminative embedding under heterogeneous feature distribution in multimodal cases. It is mainly achieved by introducing an adaptive margin constraint in terms of local embedding representation structure. And as an extension to N-pair loss [16], our method optimize the angular distance between samples as well. which is rotation-invariant and scale-invariant by nature. Furthermore, the adaptive large margin constraint is tactfully constructed by a novel technique of Virtual Point Generating (VPG), factitiously mapping a well learned positive data point to a far place. Then, by optimizing this virtually generated new point well, a large angular margin can be obtained. Moreover, the strength of margin constraint induced by VPG for individual pattern is adjustable, quantified by hyper-parameter . With bigger

, the ideal margin between samples becomes larger. Our ALMN is a flexible learning objective, and can be easily used as a drop-in loss function in the end-to-end frameworks and combined with any other hard example mining strategies. To our best knowledge, it is the first work to introduce margin constraint by generating virtual data point for deep embedding learning, virtual point generating is also an open question, in this work, we simply consider a geometrical way. Image retrieval and clustering experiments have been performed on several datasets, including CUB-200-2011

[18], CARS196 [23], Flowers102 [24], Aircraft [25] and Stanford Online Products [5].

2 Related Work

The very key goal of deep embedding learning is to learn a feature representation that keeps the distance between related data points small and unrelated data points large on the feature space. Some research works jointly optimize contrastive loss and softmax loss for the purpose of discriminative feature learning, such as DeepID2[8] and DeepID2+[26]. Facenet [9] proposes triplet loss to improve the ability of deep embedding learning without jointly training with softmax loss. And many remarkable works use triplet-based objective loss to optimize deep frameworks in many tasks [10, 27, 7, 5]. Lifted structure embedding [5] encourages that each positive pair compares the distances against all the negative pairs in one mini-batch, aiming to make full use of the mini-batch. To avoid the convergence at bad local optimum, it optimizes a smooth upper bound function of nested max functions. Local Similarity-Aware [22] generalizes triplet loss to a quadruplet-like loss and selects hard samples by PDDM units. N-pair loss [16] expands the idea of triplet or quadruplet tuple to N-pair tuple, and enforces softmax cross-entropy loss among the pairwise similarity values in the batch. We share the similar core with N-pair that takes all negative samples in the current mini-batch into consideration, but as an extension, our ALMN can lead more discriminative embedding even without hard-sample mining, as a consequence of adaptive large margin learning.

The performances of most aforementioned research works are sensitive to the selected example pairs. Selecting genius hard samples to construct a training batch can significantly improve the quality of learning, but it also incurs much computational cost. However, our ALMN does not require hard-class mining (adopted in original N-pair loss), and thus allows the training of discriminative embedding with a lower computational cost.

There are some other works aiming at learning discriminative embedding feature. Large Margin Nearest Neighbor (LMNN) [20] optimizes the Mahalanobis metric for nearest neighbor classification. Recently, Large Margin Softmax (L-Softmax) [21] encourages the angular decision margin between classes. While, it is designed for Softmax and the margin constraints are the same for any patterns, e.g. double-angle constraint for both easy and hard patterns, thus maybe unsuitable for multimodal feature space, and the convergence of model is slow. Our ALMN allows local-adaptive margin constraint and can be successfully applied in multimodal cases.

And some other works emerge in the deep embedding learning community. Clustering[28] formulates the NMI as the objective function and optimizes it in deep models. HDC[29] employs the cascaded models and selects hard-samples from different levels and models. Smart-mining[30] combines local triplet loss and global loss to optimize the deep metric with hard-samples mining. Sampling-Matters[31] proposes distance weighted sampling strategy and use a much stronger deep model(Res-50) than most existing methods. Angular loss[32] optimize a triangle-based angular function. BIER loss[33]

adopts ensemble learning framework of online gradients boosting which is totally different from our method that belongs to single feature learning family. Proxy-NCA

[34] explains why popular classification loss works from proxy-agent view, and the implementation is very similar with Softmax. In summary, different from the above methods that investigate ways of informative samples mining or feature ensemble, we mainly focus on introducing an open question, i.e. VPG, to impose large margin constraint so as to improve the discrimination of deep embedding leaning.

3 Adaptive Large Margin N-pair Loss

In deep embedding learning, our goal is to learn a deep feature embedding

from input image into a feature vector , such that the similarity between and is higher when they belong to the same class and is lower when they belong to different classes, where refers to the feature vector of image . To ensure the intra-class compactness and the inter-class separability, we introduce large margin constraint instead of exploring sample-mining strategy. One related work L-Softmax [21] uses a preset and fixed angular margin constraint to enlarge the margin between classes. While in practical vision tasks, the embedding distribution always exhibits a character of multimode due to pose and appearance [22], therefore, a fixed margin constraint is not suitable. Specifically, a relatively weaker constraint will contribute little to the optimization of easy patterns, while a rigorous constraint might be too strong to guide the training of hard patterns. Under multimodal situation, the learning of discriminative feature embedding by injecting an applicable margin constraint could suit the remedy to the case. Therefore, we propose Adaptive Large Margin N-pair loss (ALMN) that can meet the needs of multimodal feature distribution. Below, we first give a review of N-pair loss, then introduce our basic objective function, and finally show the mainstay of ALMN, i.e. Virtual Point Generating.

3.1 Review of N-pair Loss and Preliminaries

N-pair loss [16] points out that simultaneously optimizing with multiple negative samples can be regarded as an approximation of ’global optimization’ and thus can improve the performances. It is formulated as follows:


where is a regularization constant for norm and is the mini-batch size. refer to the positive point, anchor point and negative points respectively. Moreover, when minimizing Eq. 1, the optimization of inner-product-based softmax-like function is implicit to optimize the angle between samples, since the similarity based on inner product can be rewritten into , and in order to correctly separate from , N-pair loss is to force , i.e. , where is the angle between and , and this optimization is mainly determined by , verified by L-Softmax [21].

3.2 Basic Objective Function based on Centers

From minimizing Eq. 1, it can be observed that we would like to force (i.e. ) in order to correctly separate from , in another word we intend to push close to and pull far from . Apparently, the reasonability of location of the anchor point determines the stability of model training, since anchor point affects the gradients direction, and unstable direction will impede the stability of model training. To this end, we adopt class center instead of random positive sample as our anchor point . While, it is impossible to update the class centers with respect to the entire training set during each iteration. We share a similar idea with [35] that performs the update on the basis of mini-batch. At each iteration, the class centers are updated as follows:


where if the condition is satisfied, and if not. is the learning rate. Finally, our basic objective loss is as follows:


3.3 Virtual Point Generating

However, without hard-class mining, the constraint 222For simplicity, here, we consider the problem of binary class, where label . Multi-classification complicates our analysis but has the same mechanism as binary scenario. can hardly satisfy our demands of discriminative embedding learning, since it can be easily satisfied and hence stop contributing to parameter update, as shown in Fig.2.(a) where the decision boundaries for two classes are overlapped, yielding separable but not discriminative features. Inspired by L-Softmax [21], optimizing a rigorous objective is to produce more rigorous decision boundaries and larger decision margin, we propose Virtual Point Generating (VPG) to enhance the constraint by generating virtually local-hard point , this constraint based on the generated point is more suitable in multimodal space than L-Softmax, producing an adaptive decision margin. Here, we will first introduce the general concept of our VPG and then will explain how to make it adaptive. Since the training of Eq. 3 is based on angular optimization, is thus generated in the angular manner, and to keep the same amplitude as , we formulate as follows:


As shown in Fig.2, vector has the same direction with and affects the location of , here we do not focus on its specific value which will be investigated later. is a hyper-parameter to further control the location of . From the right chart of Figure 2 (), it can be observed that the new generated data point has a larger angular distance to the anchor point than . Therefore, to make a more rigorous decision boundary, we instead require


Due to the geometrical relationship in Figure 2, always hold, if we can optimize , then will spontaneously hold. So the new objective (i.e. Eq.5) is a stronger constraint (requirement) to correctly separate from , producing more rigorous decision boundaries.

Figure 2: Geometric interpretation of VPG (). The embedding features learned before and after VPG are shown in left chart, one can observe that the angular margin between brown and green classes is enlarged by VPG, since the generated purple point is the boundary example and optimizing it will benefit the discriminative feature learning. The generating process is shown in right chart.

As illustrated in the left chart of Figure 2, optimizing the objective , which is implicitly with a stronger margin constraint, is to produce a large angular decision margin between classes, and to encourage both intra-class compactness and inter-class separability. Specifically, as in Fig.2.(a) before VPG, when the training loss get to a stable level, the data points in feature space have no need to move further because they have satisfied the constraint well, however after VPG as shown in Fig.2.(b), is mapped to a boundary point or even much harder point in feature space, i.e. , so as to correctly separate from , the new decision boundary is produced, and it will further push as well as towards and far from in angular manner, yielding more compact intra-class and separable inter-class angular distributions. Moreover, naturally inferred from Figure 2, by increasing to a larger value (e.g. ), a farther is generated, in another word, a more rigorous objective is to be optimized and thus in ideal case a more discriminative embedding can be achieved.

Adaptive Margin: Without loss of generality, we consider . As mentioned above, our goal is to make an adaptive large margin constraint, and from Eq. 4 one can observe that is mainly determined by the vector . Hence, vector should be local-adaptive such that the margin constraint based on is applicable for each case, e.g. hard and easy patterns. Specifically, considering the local feature space, vector should satisfy (as in Figure 2), where is the angle between and its nearest negative vector , and is the angle between and . In summary, since is based on , with considering the local feature structure, the margin constraint introduced by is adaptive, in another word, easy patterns (larger and smaller , i.e. larger ) can be equipped with relatively stronger constraint, and hard patterns (smaller and larger , i.e. smaller ) with weaker constraint. As a consequence, the margin constraint is adaptive.

To generate , we need to compute the specific value of . However, it is not our focus and its specific value does not matter. Since, in practical application, we adopt random sampling instead of hard negative sample mining and only one mini-batch is fed into the network each iteration, so in one mini-batch is not globally optimal and is always much farther, resulting in a bigger (bigger ), i.e. bigger , in another word a farther and non-local margin constraint are introduced. As a consequence, the training will be hard and even get failure. We address this challenge by empirically and experimentally constructing a lower bound vector 333The lower bound vector has the same direction with the original vector, yet smaller amplitude. of , i.e. , as follows:

Proposition 1

is a lower bound vector of as illustrated in Figure 3.


We provide a explicit geometric interpretation of this lower bound vector . As shown in Figure 3.(a), since , and according to the Cosine Law, in , , Additionally, and are on one concentric circle and easy to prove , according to the Sine Law, in , . So always holds and from Eq. 6 we have , thus and, vector and have the same direction with . In conclusion, in Eq. 6 can be regarded as a lower bound vector of .

Figure 3: (a) gives the geometric proof. (b) shows the stable generated by ().

Replacing in Eq. 4 with the lower bound vector , we can obtain a more stable as depicted in Figure 3.(b) and formulate it as follows:


where , it addresses the problem of less-than-ideal angular constraint to some extent, which is caused by random sample mining and mini-batch training. We experimentally find that it indeed works well and also allows the stability of network optimizing.

Overall Objective: to optimize the new rigorous objective , we follow N-pair loss and formulate it as the following one, i.e. our ALMN loss:


where is shown in Eq. 7. Obviously when , and we make it as our baseline. The ALMN can be easily optimized by commonly used SGD and BP algorithm. The gradients with respect to and are listed as follows:

0:  training set ( denotes the image number), pre-trained CNN model, hyper-parameter .


1:  for  do
2:     for  do
3:        adopt as the anchor point, compute ,
4:        generate from with Eq.7.
5:        compute with Eq.8, compute gradients with Eq. 9-3.3.
6:     end for
7:     update the anchor point with Eq.2.
8:  end for

Output: Well trained deep model.

Algorithm 1 Training deep model with our ALMN

Finally, we show ALMN in Algorithm.1. Most worthy of mention is that we introduce the novel concept of VPG to enhance the margin constraint, i.e. generating a virtually boundary point and optimizing instead of the original . While our VPG does not limit the specific formulation of , we leave it as an open question and there can be other ways to generate , here for geometrical interpretation, we simply take Eq.4 and 7.

4 Discussion

The ALMN loss encourages an adaptive large angular margin among classes by a novel constraint constructing method VPG. It has some nice properties:

  • The core of VPG is to enhance the margin constraint by generating virtually hard points. And the holistic margin constraint can be controlled by hyper-parameter . With bigger , the ideal margin between classes becomes larger, yielding more discriminative embedding.

  • For any fixed , the angular margin constraint induced by VPG is local-adaptive and varies across instances, since the virtual point is generated on the basis of local feature structure. Thus, easy patterns can be supervised by stronger constraint, and hard patterns will be optimized under the relatively weaker constraint.

  • Our VPG is a generic method that can be easily combined with any other hard-sample-mining methods and model architectures.

Comparison to N-pair loss: as an extension to N-pair loss [16], our ALMN has two advantages. First, by employing class centers as the anchor points instead of random positive points, the optimization of our ALMN is more stable and ideal than N-pair loss due to the correct direction of gradients, and thus the performance of deep embedding learning can be improved, verified by the results comparison between ALMN () and N-pair loss in Table. 2 and 3. Second, and which is our most contribution, ALMN (e.g. ) can significantly encourage a large angular decision margin among classes, yielding more discriminative feature embedding than N-pair loss, and it is mainly achieved by the novel and generic VPG method. Furthermore, our ALMN does not require hard-class mining procedure which is adopted to construct the training batches in N-pair loss.

Comparison to other constraint losses: Noisy-Softmax [36] imposes annealed noise on Softmax which aims to improve the generalization ability of DCNNs. Our ALMN has a similar goal with [21, 37] that enhancing the discriminative property of learned features by exerting constraint on objective function. However, in [21]

, the constraint is specifically designed for Softmax layer, and the strength of margin constraint behind the optimization objective

are the same for each samples (e.g. m=2), and this fixed m-times-angle constraint is not applicable under heterogeneous feature distribution. In contrast, our ALMN is towards deep embedding learning, for a certain , our margin constraint behind is local-adaptive, since the virtual point is generated on the basis of its neighbouring feature space not a fixed scale. And, the margin constraint of ALMN is introduced by generated virtual point which is different from directly setting in [21].

Ablation study: to highlight the effectiveness of our local-adaptive large margin constraint, we conduct a contrast test by modifying our basic objective function (Eq. 3) to a L-Softmax-like loss, which is of the fixed angular margin constraint, as follows:


where is the same as in L-softmax. Then, we train the same CNN model with Eq. 12 () and Eq. 8 (), respectively. From Figure 7, we can observe that the training loss of L-Softmax() stops reducing at a higher level, implying it does not converge, and we infer that the double-angle constraint may be much stronger for some examples (e.g. hard patterns) and this phenomenon will disturb the overall training process. While, the loss of our ALMN drops fast to a relatively low level, demonstrating that the local-adaptive angular margin constraint can be well optimized and thus is indeed crucial to address the problem of discriminative embedding learning in multimodal cases.

Figure 4: Training loss on CUB-200-2011 dataset.
Figure 5: Mean Recall on Stanford Online Products.
Figure 6: Mean Recall on Aircraft and Flowers.
Figure 7: Mean Recall on Cub and Cars.

5 Experiments and Results

To demonstrate the effectiveness of our proposed ALMN under multimodal scenarios, we evaluate it on image clustering and retrieval tasks over several benchmark datasets, which present varieties of variations such as in pose and appearance. Notably, except class label we do not use any other annotation information such as bounding box or part annotation.

5.1 Implementation Details

For network configuration, we use the ImageNet pretrained GoogLeNet

[3] for initialization and finetune it on our target datasets. The last fully connected layer is initialized with random weights and we fix the embedding feature size at throughout all of our experiments(since the performance doesn’t change much when varying embedding sizes according to [5]). We set dropout ratio to . For fair comparison, we follow the same data preprocess method as adopted in [5], i.e. all the training and testing images are processed into and then mean subtraction is performed. For data augmentation, all training images are randomly cropped to

and randomly mirrored. All of our experiments are implemented by Caffe library

[38] with our own modifications.

As we mentioned in the above section, we do not perform hard-class mining. Instead, we construct a random batch in manner, where and denote the number of classes and the number of samples in each class, respectively. Note that, the classes and samples are all randomly selected. And we will investigate the affects of different combinations of and in the following subsection.

Training: The initial learning rate is and multiplied by at iteration. However, the total iterations are and for (CUB, Flowers, Aircraft) and CARS196, respectively. We use a weight decay of and momentum of . Moreover, the regularization constant for norm is and we use 10 times learning rate for the feature layer.

Evaluation: The same as many other research works [5, 16, 22], we use the and NMI metrics for image clustering task and the Recall@K metric for image retrieval task. We use the simple cosine distance for the evaluation of the embedding feature. We make ALMN (), which means training without VPG (i.e. ), as our baseline. For comparison, we evaluate many existing methods, and implement some of them with the same network and training configurations as ours, including triplet loss [9], lifted structured embedding [5] and N-pair loss [16].

5.2 Component Analysis

Mini-batch combination: To acquire a stable location of the anchor point, we employ the class center . However, we experimentally found that the combination of mini-batch is important to the update of .

Inspired by N-pair loss, we construct a mini-batch, where and denote the number of classes and the number of the samples in each class, respectively. Throughout our experiments the value of is fixed, and we can imagine that, as increases, there are more and more positive samples to contribute to the update of at the same time, resulting in a more stable and more real class center. However, when is large enough and , i.e. in negative sample limit, there is no contribution from negative samples and thus the inter-class separability will not be guaranteed.

CUB-200-2011 Cars196
65 x 2 26 x 5 16 x 8 8 x 16 65 x 2 26 x 5 16 x 8 8 x 16
Recall@K=1 51.1 52.4 52.1 51.1 64.2 71.6 69.7 68.8
Recall@K=2 63.7 64.8 64.4 64.0 75.2 81.3 80.5 79.1
Recall@K=4 74.5 75.4 75.6 74.6 83.7 88.2 88.3 86.4
Recall@K=8 83.6 84.3 84.3 84.0 90.0 93.4 92.8 92.3
F1 27.2 28.5 27.5 28 24.6 29.4 26.9 25.3
NMI 59.7 60.7 59.6 60.3 57.9 62.0 60.9 58.6
Table 1: F1, NMI, and recall@K scores (%) on CUB-200-2011 [18] and CARS196 [23] datasets with different combinations of .

We evaluate the performances of the ALMN loss with different combinations of on CUB-200-2011 [18] and CARS196 [23]. And the experimental results are listed in Table 1. From the results, one can observe that the performances are different when using various combinations of , where the total batch sizes are almost the same. As we analyzed above, a relatively appropriate combination of is required, which is important for stable training and discriminative embedding learning. And we use the combination of in the following subsections. Notably, although we need to construct the mini-batch according to some protocol, the selection is totally random and there is no computational cost since there is no demand to evaluate the embedding vectors in deep learning framework, which is different from hard-class mining in N-pair loss.

Enlarging angular margin: We can further enhance the angular margin constraint by increasing parameter such that a larger decision margin among classes can be produced and the more discriminative embedding can be achieved. From Table 2, when in zero constraint limit, our baseline algorithm obtain relatively lower results. Then one can observe that, when our ALMN can significantly improve nearly and R@1 accuracies over CUB and CARS datasets respectively, verifying the effectiveness of the adaptive large margin constraint induced by VPG. Afterwards, it can further improve the performances over all datasets by increasing e.g. , demonstrating our initial thought that larger decision margin among classes will encourage the learning of discriminative embedding. Likewise, the improvements can also be found in other datasets as in Table.3 4.

5.3 Comparison with State-of-the-art

CUB-200-2011 Cars196
R@1 R@2 R@4 R@8 F1 NMI R@1 R@2 R@4 R@8 F1 NMI
Google[3] 40.8 53.8 67.0 78.2 18.0 51.5 35.5 47.5 58.9 71.5 8.6 37.1
Triplet[9] 36.1 48.6 59.3 70.0 15.1 49.8 39.1 50.4 63.3 74.5 16.8 51.4
Lifted[5] 47.2 58.9 70.2 80.2 21.2 55.6 49.0 60.3 72.1 81.5 21.8 55.0
Clustering[28] 48.2 61.4 71.8 81.9 - 59.2 58.1 70.6 80.3 87.8 - 59.4
S-mining[30] 49.8 62.3 74.1 83.3 - 59.9 64.7 76.2 84.2 90.2 - 59.5
Angular[32] 53.6 65.0 75.3 83.7 30.2 61.0 71.3 80.7 87.0 91.8 31.8 62.4
N-pair[16] 49.1 61.2 72.7 82.1 25.9 58.5 63.6 74.7 84.1 90.1 23.9 57.4
Proxy NCA[34] 49.2 61.9 67.9 72.4 - 59.5 73.2 82.4 86.4 88.7 - 64.9
ALMN () 50.4 62.7 73.5 82.9 27.6 59.4 66.2 76.7 85.1 91.4 23.6 56.7
ALMN () 52.0 64.5 74.8 83.7 28.2 60.2 70.4 80.4 87.3 92.5 26.3 59.3
ALMN () 52.2 64.7 75.3 84.2 28.2 60.7 71.3 81.2 88.1 93.1 28.3 61.5
ALMN () 52.4 64.8 75.4 84.3 28.5 60.7 71.6 81.3 88.2 93.4 29.4 62.0
Table 2: Image clustering and retrieval results(%) on CUB [18] and Cars196 [23]. refers to our re-implement. Our best results are bold-faced.

CUB-200-2011 dataset [18] includes 11,788 bird images coming from 200 classes. We use the first 100 classes for training (5,864 images) and the rest 100 classes for testing (5,924 images). We list our experimental results together with those of other state-of-the-art methods in Table 2. From the results, one can observe that our baseline ALMN () outperforms N-pair loss even without large margin constraint, demonstrating that the reasonability of location of the anchor point can not only make training stable but improve the performance. And by introducing an adaptive large angular margin constraint among classes, our ALMN () can significantly improve the performances and also outperforms most existing methods, even achieving the comparable results compared to the state-of-the-art methods, and thus verifying the effectiveness of our adaptive large margin constraint.

CARS196 dataset [23] includes 16,185 car images coming from 196 classes. We split the first 98 classes for training (8,054 images) and the rest 98 classes for testing (8,131 images). We list our experimental results together with that of other state-of-the-art methods in Table 2. From the results, it can be observed that ALMN () shows the better performances than N-pair loss, demonstrating the superiority of our choice of the anchor point . Then, ALMN () can significantly improve nearly R@1 result over the baseline ALMN and also outperforms most of the other existing methods, obtains comparable results compared to state-of-the-art, verifying the effectiveness of our method.

Flowers102 Aircraft
R@1 R@2 R@4 R@8 F1 NMI R@1 R@2 R@4 R@8 F1 NMI
Googlenet[3] 80.5 87.6 92.9 95.7 41.0 63.8 42.0 52.8 64.2 75.6 10.3 30.0
Triplet[9] 80.3 87.2 92.0 95.7 41.3 64.0 41.8 53.5 64.4 75.3 10.7 31.3
Lifted[5] 82.6 89.4 93.1 96.0 43.3 65.9 53.8 67.5 77.7 85.5 23.8 51.9
n-pair[16] 83.3 89.9 93.9 96.4 43.2 66.1 56.1 69.0 80.2 87.7 24.7 52.4
ALMN() 85.3 91.4 94.7 97.2 53.1 71.5 63.5 74.2 83.3 90.0 25.7 53.3
ALMN() 88.8 93.1 95.9 98.1 56.3 75.7 67.0 78.1 86.6 91.3 29.5 56.2
ALMN() 89.5 93.8 96.3 98.0 56.6 75.9 67.9 79.3 87.0 91.8 30.4 57.2
ALMN() 90.1 94.0 96.6 98.2 57.0 76.2 68.4 79.9 87.2 92.0 30.7 57.9
Table 3: Image clustering and retrieval results on Flowers102 [24] and Aircraft dataset[25]. refers to our re-implement. And our best results are bold-faced.

Flowers102 The Flowers102 dataset [24] includes 8189 flower images from 102 classes. Each class consists of between 40 and 258 images. We split the first 51 classes for training (3493 images) and the rest 51 classes for testing (4696 images). We implement triplet loss [9], lifted structured embedding [5] and n-pair loss [16] with the same network and training configurations as ours and test them with the single crop. From the results shown in Table. 3, our baseline ALMN() outperforms other works by adopting a stable anchor point. And ALMN() can further improve the performances for image clustering and retrieval tasks by learning a discriminative embedding with adaptive large margin constraint, demonstrating the superiority of our method.

Aircraft The Aircraft dataset [25] has 100 classes of aircrafts with 10,000 images. We split the first 50 classes for training (5,000 images) and the other 50 classes for testing (5,000 images). We also implement triplet loss [9], lifted structured embedding [5] and n-pair loss [16] with the same network and training configurations as ours and then test them with the single crop. From the results shown in Table. 3, our baseline ALMN() outperforms other works by adopting a stable anchor point. And ALMN() can further improve nearly and for image retrieval and clustering (F1) tasks respectively by learning a discriminative embedding with adaptive large margin constraint.

Lifted[5] n-pair[16] Clustering[28] Angular[32] HDC[29] BIER[33] ALMN() ALMN()
R@1 62.5 66.4 67 67.9 69.5 72.7 69.3 69.9
R@10 80.8 83.2 83.6 83.2 84.4 86.5 84.5 84.8
R@100 91.9 93 93.2 92.2 92.8 94 92.7 92.8
Table 4: Results on Stanford Online dataset[5]. Our best results are bold-faced.

Stanford Online Products dataset[5] has images of online classes and each class has images on average. Following the zero-shot protocol, we also split the first classes for training and the remaining classes for testing. We show our final results in Table.4. One can observe that our method () achieves appealing results compared to other single-feature methods and the ensemble-feature method(e.g. HDC[29] and BIER[33], ensemble is well known better than single feature).

The Mean Recall comparisons over these datasets are in Figure7 7 7.

5.4 Cases Study

To show the results of discriminative embedding learning under multimodal scenario, we provide some cases over CUB-200-2011 [18] and Cars196 [23] datasets in Figure 8. From the comparison between top-1 positive and top-1 negative retrieval, it can be observed that the image is correctly retrieved by our algorithm. Then by introducing the adaptive large margin constraint among classes, our ALMN () can significantly increase the similarity score between the query and top-1 positive retrieval images, implying that the intra-class compactness is strengthened. And from the results of top-1 negative retrieval results, one can observe that our ALMN () can significantly reduce the similarity score between the query and top-1 negative sample, demonstrating that our method produces a more separable inter-class distance.

Figure 8: Retrieval task cases on CUB [18] and Cars196 [23] datasets. The query images are shown on top of the figure. Top-1 positive and top-1 negative images retrieved by our ALMN are marked with red and blue boxes, respectively. And the similarity scores using ALMN () and ALMN () are orderly shown underneath the images.

6 Conclusion

In this paper, we propose ALMN to address the problem of discriminating feature learning in multimodal feature space. It encourages intra-class compactness and inter-class separability by enlarging the angular decision margin among classes. And the prudent margin constraint is local-adaptive. Moreover, the novel concept of VPG gives chances of discriminative embedding learning without hard-example mining, and the virtual point generating method is an open question which may benefit the community. Extensive quantitative and qualitative results demonstrate the effectiveness of our proposed method.


  • [1] Krizhevsky, A., Sutskever, I., Hinton, G.E.:

    Imagenet classification with deep convolutional neural networks.

    In: Advances in neural information processing systems. (2012) 1097–1105
  • [2] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • [3] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions.

    In: Computer Vision and Pattern Recognition. (2014) 1–9

  • [4] Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: Cnn architecture for weakly supervised place recognition. In: IEEE Conference on Computer Vision and Pattern Recognition. (2016) 5297–5307
  • [5] Oh Song, H., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 4004–4012
  • [6] Hershey, J.R., Chen, Z., Roux, J.L., Watanabe, S.: Deep clustering: Discriminative embeddings for segmentation and separation. In: IEEE International Conference on Acoustics, Speech and Signal Processing. (2015) 31–35
  • [7] Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: International Workshop on Similarity-Based Pattern Recognition, Springer (2015) 84–92
  • [8] Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint identification-verification. In: Advances in neural information processing systems. (2014) 1988–1996
  • [9] Schroff, F.e.a.:

    Facenet: A unified embedding for face recognition and clustering.

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 815–823
  • [10] Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: British Machine Vision Conference. (2015) 41.1–41.12
  • [11] Yi, D., Lei, Z., Liao, S., Li, S.Z.: Deep metric learning for person re-identification. In: International Conference on Pattern Recognition. (2014) 34–39
  • [12] Tahmoresnezhad, J., Hashemi, S.: Visual domain adaptation via transfer feature learning. Knowledge and Information Systems (2016) 1–21
  • [13] Long, M., Wang, J., Ding, G., Sun, J.: Transfer joint matching for unsupervised domain adaptation. In: IEEE Conference on Computer Vision and Pattern Recognition. (2014) 1410–1417
  • [14] Lin, J., Morere, O., Chandrasekhar, V., Veillard, A., Goh, H.: Deephash: Getting regularization, depth and fine-tuning right. Mccarthy (2015)
  • [15] Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I.: Discriminative learning of deep convolutional feature point descriptors. In: IEEE International Conference on Computer Vision. (2015) 118–126
  • [16] Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: Advances in Neural Information Processing Systems. (2016) 1857–1865
  • [17] Laurens, V.D.M.: Accelerating t-sne using tree-based algorithms.

    Journal of Machine Learning Research

    15(1) (2015) 3221–3245
  • [18] Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds200-2011 dataset. California Institute of Technology (2011)
  • [19] Lecun, Y., Cortes, C.:

    The mnist database of handwritten digits.

  • [20] Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10(1) (2006) 207–244
  • [21] Liu, W., Wen, Y.: Large-margin softmax loss for convolutional neural networks. In: ICML. (2016)
  • [22] Huang, C., Loy, C.C., Tang, X.: Local similarity-aware deep feature embedding. In: Advances in Neural Information Processing Systems. (2016) 1262–1270
  • [23] Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: IEEE International Conference on Computer Vision Workshops. (2013) 554–561
  • [24] Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing. (Dec 2008)
  • [25] Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. HAL - INRIA (2013)
  • [26] Sun, Y., Wang, X., Tang, X.: Deeply learned face representations are sparse, selective, and robust. In: IEEE Conference on Computer Vision and Pattern Recognition. (2014) 2892–2900
  • [27] Qian, Q., Jin, R., Zhu, S., Lin, Y.: Fine-grained visual categorization via multi-stage metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3716–3724
  • [28] Song, H.O., Jegelka, S., Rathod, V., Murphy, K.: Deep metric learning via facility location. In: Computer Vision and Pattern Recognition (CVPR). (2017)
  • [29] Yuan, Y., Yang, K., Zhang, C.: Hard-aware deeply cascaded embedding. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
  • [30] Kumar, V.B., Harwood, B., Carneiro, G., Reid, I., Drummond, T.: Smart mining for deep metric learning. arXiv preprint arXiv:1704.01285 (2017)
  • [31] Wu, C.Y., Manmatha, R., Smola, A.J., Krahenbuhl, P.: Sampling matters in deep embedding learning. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
  • [32] Wang, J., Zhou, F., Wen, S., Liu, X., Lin, Y.: Deep metric learning with angular loss. arXiv preprint arXiv:1708.01682 (2017)
  • [33] Opitz, M., Waltner, G., Possegger, H., Bischof, H.: Bier - boosting independent embeddings robustly. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
  • [34] Movshovitz-Attias, Y., Toshev, A., Leung, T.K., Ioffe, S., Singh, S.: No fuss distance metric learning using proxies. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
  • [35] Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision, Springer (2016) 499–515
  • [36] Chen, B., Deng, W., Du, J.: Noisy softmax: Improving the generalization ability of dcnn via postponing the early softmax saturation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017)
  • [37] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Volume 1. (2017)
  • [38] Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan: Caffe: Convolutional architecture for fast feature embedding. Eprint Arxiv (2014) 675–678