1 Introduction
Deep learning disrupted the computer vision field, especially regarding feature extraction and image representation. In this context, convolutional neural networks (CNNs) play a significant role
(Krizhevsky et al., 2009; LeCun et al., 2015). Once deep CNNs are trained on a massive amount of data, it becomes a candidate to be used as a deep feature descriptor, not only in the domain or task in which the CNN was trained but in every computer vision problem, thanks to transfer learning techniques
(Goodfellow et al., 2016). However, the loss function in which these CNNs were originally pretrained, strongly influences the extracted deep representation features. The loss function is of paramount importance for training a CNN model and several approaches have been proposed in the literature (Sharif Razavian et al., 2014; Schroff et al., 2015; Parkhi et al., 2015; Wang et al., 2017, 2019a, 2019b). Among them, one very popular is the crossentropy loss (Goodfellow et al., 2016), also known as softmaxloss since it is often preceded by a softmax operation. Although originally designed for classification problems, it has been very successful in learning deep representative features along with CNN models (Krizhevsky et al., 2009; Sharif Razavian et al., 2014).The softmax loss is widely used for several reasons, namely it is easy to interpret and implement, fast convergence, and for working well with different batch sizes (large or very small). Even though it is not designed for learning feature representations, the learned features are powerful enough for many tasks, such as face recognition (Parkhi et al., 2015; Wang et al., 2017), ocular recognition (Luz et al., 2018; Silva et al., 2018), among others. However, in tasks in which it is necessary to know how close or how far the samples are concerning each other (such as the biometric verification task) on highdimensional spaces, this type of loss is not the best option. There are efforts in the literature to adapt the softmax loss (Ranjan et al., 2017; Wang et al., 2017) to the task, but contrastive and tripletbased losses are still the ones that offer the best gains.
Contrastive loss better captures the relationship between two samples projected in a space (Euclidean, for example), by penalizing, during learning, negative samples (from impostors) and rewarding samples from the same (genuine) category (Chopra et al., 2005). Likewise, tripletbased loss explores the concept of similarities and dissimilarities between samples in a space, adding anchor elements. There are many tripletbased losses in the literature, such as tripletcenter loss (He et al., 2018), quadruplet loss (Law et al., 2013) and in general, tripletbased loss produce better results, overcoming other pairwise losses such Npairs loss (Sohn, 2016), binomial deviance loss (Yi et al., 2014), histogram loss (Ustinova and Lempitsky, 2016) and MultiSimilarity Loss (Wang et al., 2019b).
Low convergence represents a major problem for tripletbased losses. Besides, given a set of samples, it is not trivial to find positive or negative instances to use as hard pairs, nor is it easy to finetune the margin that separates them (Parkhi et al., 2015). Notwithstanding, tripledbased losses are still the most popular losses in the literature, despite their limitations.
In this paper, we propose a new loss function, the Dloss, based on the decidability index (Daugman, 2000). Daugman Daugman (2000) highlights this index as a quality measure for biometric systems, which is often used in the literature for this purpose (De Marsico et al., 2018; Luz et al., 2018). The Dloss assures both interclass and intraclass separability, and, unlike tripletloss, avoids the difficult problem of finding hard positive and negatives samples. It also provides better convergence, since it does not require parameter adjustment, and is easy to implement.
The contributions of this paper can be summarized as follows:

A new loss function, the DLoss, based on the decidability index, which is intuitive and easy to compute and implement. The Dloss is also suitable for training models which aim at data representation and, unlike triplets loss, it favors both interclass separability and intraclass approximation.

Under the same conditions, the Dloss overcomes three other popular loss functions (Triplets Softhard Loss, MultiSimilarity Loss, and SoftmaxLoss) in MNISTFASHION, CIFAR10 and presents comparable results for MNIST. Therefore, a competitive loss function.

The Dloss has converged well on higher capacity networks such as the networks of the EfficientNet family. Results with Dloss overcome all the three other popular loss functions (Triplets Softhard Loss, MultiSimilarity Loss, and SoftmaxLoss) evaluated here, for the ocular database CASIAIrisV4 on the same model (B0) and training conditions.
The manuscript is organized as follows: Section 2 presents a background based on related works. Section 3 describes the methodology, followed by the experiments and discussion in Section 4. Finally, the conclusion appears in Section 5.
2 Background and Related Works
In this section, two categories of loss functions are introduced: one designed for classification tasks and another for verification tasks, usually carried out in a pairwise fashion.
2.1 Classification Losses – Softmax Loss
Given a training set , and their respective labels , where is the number of classes in the problem, the softmax loss function is given by:
(1) 
in which is the learned feature map, or embedding, is the dimension of the embedding. The and , , are the weights and biases of the last fully layer, respectively (see Figure 1). This formulation enforces features to have larger magnitudes and a radial distribution (Wang et al., 2017). Thus, it does not optimize the features to have small dissimilarity scores for positive pairs or higher dissimilarity scores for the negative ones. Other approaches tried to mitigate such effects on softmaxloss.
2.2 Pairwise losses
2.2.1 Triplet losses
A triplet is a set of three components: an anchor , a sample of the same class (positive), and one from another class (negative) (Parkhi et al., 2015; Schroff et al., 2015; Wang et al., 2019a). The positive and negative pairs share the same anchor, and the aim is to make the embedding of the positive pair get closer, and the negative to separate according to
(2) 
in which is a desired margin, and is the squared L2 distance. The triplet loss is given by
(3) 
for a set of triplets. Nonzero values appear when the inequality (2) does not hold for a given margin . Figure 2 exhibits a sketch of the process, with the green arrow indicating the distance between a positive pair and in violet the distance between a negative.
As in other pairbased losses, sampling is an important step (Wang et al., 2019b). The dataset must have sufficient recordings of the same class to use as positive pairs, otherwise the triplets are not viable. Moreover, randomly chosen and easily satisfies the condition (2) since the euclidean distances between the encodings of images of different people are likely to be large whilst those from the same person tends to be small. The main drawback of the random strategy is low convergence due to uninformative samples with negligible contributions (Wang et al., 2019b).
One solution is to perform hard mining by choosing in a manner that , and with . This gives positive (negative) pairs whose distances to the anchor are the highest (smallest). In this case it is possible that and the margin starts to play an important role.
The hard mining process demands high computational power and may lead to biased positive and negative pictures ranging from mislabeled to poorly imaged samples. Some strategies to overcome these issues are: (i) to generate triplets offline every steps and compute argmin and argmax on a subset of the data; (ii) generate triplets online seeking hard pairs from the minibatch; (iii) semihard negative samples (Schroff et al., 2015).
In some problems, images from the same class may present high dissimilarity when compared to images from different classes due to differences in lighting, color, and pose. The Ranked List Loss (RLL) (Wang et al., 2019a) and the Online Soft Mining (OSM) (Wang et al., 2019) preserve intraclass data distribution instead of shrinking the encodings into a single point.
2.2.2 MultiSimilarity Loss
There are other pairbased loss functions, such Npairs loss (Sohn, 2016), binomial deviance loss (Yi et al., 2014), histogram loss (Ustinova and Lempitsky, 2016) and MultiSimilarity Loss (Wang et al., 2019b). Among them, outstanding results have been reported with MultiSimilarity Loss. Therefore, we consider the MultiSimilarity Loss in this work for comparison purposes..
The MultiSimilarity Loss is based on the pair weighting formulation, which analyzes simultaneously three types of similarities before making a decision: selfsimilarity, negative relative similarity, and positive relative similarity. The approach consists of a twostep scheme: hard pairs selection (hard mining) and weighting. First, pairs are selected by means of positive relative similarity, then the selected pairs are weighted using both selfsimilarity and negative relative similarity, inspired by binomial deviance loss (Yi et al., 2014) and lifted structure loss (Oh Song et al., 2016). A framework is proposed to integrate the two steps, the General Pair Weighting (GPW) framework.
3 Methodology
In this section, the decidability metric is formally described along with the training and optimization strategies.
We do not intend to compare our results with stateoftheart metric learning methods, but rather to evaluate the robustness and discriminative potential of the representations obtained with the Dloss. We are especially interested in assessing Dloss performance in a biometriclike problem.
3.1 Decidability
A typical recognition system, such as biometric recognition, can be analyzed from four different perspectives: (i) False Accept Rate (FAR) in which an impostor is accepted as genuine, (ii) False Reject Rate (FRR) in which an genuine individual is classified as an impostor, (iii) Correct Accept Rate (CAR) in which an genuine individual is accepted, and (iv) Correct Reject Rate (CRR) in which an impostor is correctly not accepted. The FAR and FRR are related to a system error and they are called Type I and Type II error respectively. It is worth mentioning that we discuss the distance function from a point of view of dissimilarity in this work. Figure
3 shows the relation of the four perspectives when analyzed using the distribution of the scores.To illustrate, let be a gallery of samples. After propagating
through a neural network, one obtains
, in which is the embedding function, and is a representation in the embedding space. The dissimilarity score curves (genuine and impostor) use as input. The scores are generated by computing the distance of pairs of samples, on an all against all fashion, as shown in Figure 4. The scores are further mapped into categories (or bins), and a histogram of the frequencies is generated by(4) 
where is the frequency count for a specific bin , the number of samples that meets the bin conditions and the score related to the distance between a pair of samples.
Let and be the two distribution curves shown in Figure 3 regarding genuine and impostor, respectively. The
is the probability density function of the dissimilarity scores, computed from the distance (through some distance metric) between two instances of the same class. Likewise, the
curve represents the probability density of the dissimilarity scores, computed from the distance between instances of different classes.
Thus, the four areas under the two distribution curves, such as illustrated in Figure 3, express the probabilities of each decision metric (FAR, CAR, FRR, CRR).
The overlay of the authentic and the impostor scores is expected in a real scenario. The criterion threshold (C) can be manipulated and the FAR, FRR, CAR, and CRR updated according to users’ needs. By manipulating this criterion and plotting the results in function of CAR and FAR, whose sum should result in one, a Receiver Operating Characteristic (ROC) or NeymanPearson curve can be created.
Ideally, the decision curve should be positioned as close as possible to the top left corner of the decision strategy curve in which the Correct Accept is close to 100% and the False Accept (error), close to 0%. However, this does not happen in the real world. To overcome those issues, arises the decidability index.
According to Daugman Daugman (2000)
, the decidability along with the distribution curves are good descriptors for the decision curve and can be better than FAR and FRR to assess pattern recognition systems performance.
The decidability can be defined as a correlation of genuine and impostor scores as defined in
(5) 
in which and are the means of the two distributions (histogram curve of genuine and impostor scores) and and
are their corresponding standard deviations. The decidability indicates how far the genuine distribution scores curve is from the impostor and the overlap between these two curves (as shown in Figure
3).Although the decidability index, as well as the other metrics presented in this subsection, is widely used in the biometric field, one could expand it to most pattern recognition problems. Since the decidability is independent of decision thresholds and quantifies, in a single scalar, the distance and overlap among two distribution curves, one hypothesis is that it might work as a loss function for training models aiming at image/signal representation.
3.2 Dloss
Many computer vision problems use deep learning models for data representation. Several authors use the crossentropy loss (softmaxloss) to finetune models to a new domain or task, and it is also common to train models from scratch. To calculate the loss in a common supervised learning problem, with the crossentropy loss, one uses the softmax operation to drive the output to probabilities (of each class), and the loss function aims to raise the accuracy on training data, according to a class hot encoded array. Thus, the problem is reduced to a classification problem, in with the main goal is not focused on the generation of better representation for the inputs, but on correctly classifying classes.
The stateoftheart losses, triplebased ones, employ the concept of the anchor. Given an anchor sample, tripletbased loss aims to learn an embedding space, where positive pairs are forced to be closer to the anchor than the negative, by a margin. The pairbased losses, such as constrative (Chopra et al., 2005), tripletcenter (He et al., 2018), and quadruplet (Law et al., 2013), put the focus on the generation of embeddings.
Ideally, those pairs tend to bring more information and enhance the power of the model to represent the data.
The pair selection represents a major issue for that kind of loss, since finding hard positive or negative samples related to anchor is not a trivial task. They also suffer from low convergence, and adjusting the margin parameter is not trivial (Parkhi et al., 2015).
We propose a new loss function, the Dloss. Differently from others, it makes use of all samples in a batch and does not rely on the concept of an anchor. The optimization of Eq. (5) separates and
and reduces the variances
and at the same time, thus improving both intraclass and interclass scenarios as shown in Figure 5.The decidability index can rise to infinity, and the higher its value, the best is the separation between the impostor and genuine distributions. To better suit a minimization objective, we define Dloss as follows:
(6) 
with as defined in Eq. (5), and the embedding function based on Euclidean Distance
(7) 
The is given by , in which is the number of samples, is the training data and is the class of the sample. It is worth highlighting that the computation of the loss considers the entire batch, and the objective is to minimize the .
3.3 Training and Weights Updating
The training process using the Dloss is similar to training a traditional neural network. The main difference relies on the fact that the last layer of the CNN model is the embedding layer instead of the commonly used neurons that represent the classes of the problem. Still, the process of updating the weights is equal to the traditional.
Several batches are created from the training data. Their size is a hyperparameter of the network and the selection of the data within a batch is random. The weights are updated after the processing of each minibatch. First, a distance is calculated for each pair within a minibatch, and, based on those distances, the genuine and impostor distribution curves are computed. Subsequently, the mean and standard deviation of curves are calculated to obtain the decidability. The learning process aims to minimize decidability. The entire process is presented in Algorithm 1.
4 Experiments
This section presents the details of the experiments, the results, and the implications of the proposed methodology. Source code will be available after acceptance.
4.1 Implementation Details
We implement the CNN functions and loss function with the TensorFlow/Keras Library. The CNN models are trained on a GPU GeForce Titan X with 12GB. In order to perform a fair comparison among losses and avoid experimental flaws such as pointed out by Musgrave
et al. Musgrave et al. (2020), all losses are evaluated under the same conditions, fixing the network architecture, instances in the batches (same seed), optimization algorithm (Adam optimizer, initial learning rate of 10), and same embedding dimension size.4.2 Datasets
MNIST, FashionMNIST, and CIFAR10 are used as proof of concept, mainly to investigate the convergence of Dloss in different domains. Also, these datasets are common benchmarks in machine learning.
We also employ a popular periocular benchmark to evaluate the proposed loss in a biometric scenario: CASIAIrisV4. For a fair comparison, a 3fold classification strategy is evaluated. All 249 individuals are split equally in three groups without overlap. All samples of an individual is just in one fold. To conduct the experiments, we set 30% of the train data for validation.
Mnist:
The MNIST dataset of handwritten digits (LeCun et al., 2010) consists of grayscale images with a size of 28x28 each. There are ten classes (zero to nine), and each contains 6,000 images for training and 1,000 for testing.
FashionMNIST:
Similar to MNIST, the FashionMNIST (Xiao et al., 2017) has grayscale images of 28x28 resolution, describing fashion items (shoes, coat, etc.). It is composed of 10 classes with the same MNIST distribution over the classes.
Cifar10:
The CIFAR10 (Krizhevsky et al., 2009) is a 10 class problem, with 6,000 images per class. It is a collection of 60,000 colored images, in which 10,000 images are for testing and 50,000 for training. Each image is a RGB of size 32x32.
CASIAIrisV4:
The CASIAIrisV4 contains six subsets and is an extension of CASIAIrisV3 dataset. In this work, the CASIAIrisInterval (subset from CASIAIrisV3) is used to report the results. The iris images are from 249 different subjects with a total of 1,438 JPEG images captured under near infrared illumination.
4.3 Evaluation Metrics
Metrics such as precision, recall, and accuracy are not the most suitable ones for biometric learning problems. Therefore, we use the False Acceptance Rate (FAR), False Rejection Rate (FRR), Equal Error Rate (EER), and Recall@K.
The EER is the point in which the FRR and FAR have the same values on the decision error tradeoff curve. The FRR is the rate of incorrect rejections over different thresholds (zero to one), while the FAR is the rate of incorrect acceptances. The EER is the point of intersection of both FAR and FRR.
4.4 CNN Architecture and Model Training
We run our experiments on three different architectures. Two architectures are small networks (Figure 6) and each one matches the input size resolution of MNIST and CIFAR datasets ( and ). The architectures used with the MNIST, Fashion, and CIFAR10 make use of simple operations that are well understood in the literature. The rationale to use such architectures (with simple convolutional blocks) is to emphasize the impact of the loss function.
An architecture with more capacity is selected for the CASIAIrisV4 dataset. The experiments are run on the EfficientNet family network (B0 model) (Tan and Le, 2019). The architecture is larger, more sophisticated and has an input size resolution of .
(a) MNISTarchitecture proposed in which conv1, conv2 and conv3 have filters size equal to 2x2, with ReLU activation and padding to avoid reduction after the convolution. All poolings are with the max function and window size of 2. Dropout of 30%. 
(b) CIFARarchitecture proposed in which conv1, conv2 and conv3 have filters size equal to 3x3, with ReLU activation and padding to avoid reduction after the convolution. All poolings are with the max function and window size of 2. Dropout of 20%. 
All architectures have an embedding layer with dimension 256 (256feature array output) with L2 normalization. The MNISTarchitecture has 100,010 parameters, the CIFARarchitecture 567,082 parameters, and CASIAV4architecture 4,377,500 parameters. Before the loss layer, a Lambda Layer with an L2 normalization is used.
With these three architectures, we are able to evaluate the Dloss in different scenarios and from shallow and simple networks to one stateoftheart network such as EfficientNet B0.
No data augmentation is implemented during training, and the euclidean distance is used to calculate the dissimilarity scores.
We employ Adam optimizer to train the models with a learning rate of , and all images normalized to the range
. The number of epochs is the same for all experiments, and dataset dependent:
epochs for the MNIST, epochs for the FashionMNIST, epochs for CIFAR10 and 1,000 epochs for CASIAV4.An initial experiment is performed on the MNIST dataset, to assess the impact of the batch size on both Dloss and Tripletloss. We evaluate the batch sizes: 300, 400, and 500. As shown in Table 1, the batch size equal to 400 performed better on validation data for the Dloss. For triplet loss, results are very similar, both for batch sizes 300 and 400. Thus, a batch size of 400 is fixed for all losses to maintain the same evaluation setup in all experiments related to MNIST, Fashion and CIFAR10. For the experiments on the CASIAV4, a batch size of 150 is employed due to the network architecture size and hardware constraints.
Dataset 
Dloss 
Triplets Softhard Loss 


EER (%) 
Dec 
EER (%) 
Dec  
300  1.07  7.92  0.70  7.63 
400  0.47  9.31  0.74  7.25 
500  1.01  8.22  0.83  7.10 
4.5 Results and Literature Comparison
Tables 2 and 3 present a comparison between the Dloss, the Triplets softhard loss, the Multi Similarity Loss, and the Softmaxloss. The proposed approach outperformed the others in two out of four evaluated scenarios. On the remaining it exhibit comparable performance.
Dataset  Dloss  MS Loss  TS Loss*  Softmax 

MNIST (10)  1.04  1.06  0.91  1.50 
Fashion (10)  5.38  5.82  5.94  7.58 
CIFAR10  13.01  14.11  20.01  14.08 
CASIAV4  7.96 3.43  7.84 1.29  38.55 53.25  9.4 1.92 
Dataset  R@K  Dloss  MS Loss  TS Loss*  Softmax 

MNIST (10)  1  0.99  0.98  0.99  0.99 
2  0.99  0.99  0.99  1.00  
4  0.99  0.99  1.00  1.00  
8  0.99  1.00  1.00  1.00  
Fashion (10)  1  0.88  0.88  0.89  0.89 
2  0.93  0.93  0.94  0.94  
4  0.96  0.96  0.96  0.96  
8  0.97  0.98  0.97  0.98  
CIFAR10  1  0.77  0.77  0.79  0.74 
2  0.84  0.84  0.85  0.82  
4  0.89  0.89  0.90  0.88  
8  0.92  0.92  0.93  0.92  
CASIAV4  1  0.99 0.01  0.95 0.03  0.66 0.57  0.99 0.01 
2  0.99 0.00  0.98 0.01  0.67 0.58  0.99 0.01  
4  1.00 0.01  0.99 0.01  0.67 0.58  0.99 0.01  
8  1.00 0.01  1.00 0.00  0.67 0.58  1.00 0.00  
16  1.00 0.01  1.00 0.00  0.67 0.58  1.00 0.00  
32  1.00 0.00  1.00 0.00  0.67 0.58  1.00 0.00 
All scenarios present training overfitting as well. However, for the simple datasets (MNIST and FashionMNIST), a balanced degree of specialization and generalization is observed for all three losses.
4.6 Discussion
The main advantage of the Dloss over the triplets is that it does not require hard samples selection, which is a costly and difficult operation. That operation substantially increases the complexity of tripletbased losses.
A considerable disadvantage of the proposed loss is related to batch size. While the majority of the pairbased distance losses need an anchor and at least a negative sample, the proposed approach depends on a large number of both positive and negative samples to create the similarity (or dissimilarity) score distribution curves for the loss computation.
Regarding memory consumption, all pairbased losses analyzed here are comparable. While the Dloss and the triplets store distances between pairs, the multisimilarity loss stores the multiplication of the embeddings.
The impact of training with Dloss is more apparent in the genuine and impostor distributions curves, as one can see in Figure 7. Figure 7 (a) corresponds to the nontrained model, and Figure 7 (b) is the result after training.
(a) 
(b) 
5 Conclusion
Based on the decidability index, which is intuitive, easy to compute, and implement, we propose the Dloss as an alternative to tripletbased losses and the softmaxloss. The Dloss function is more suitable than softmaxloss for training models aiming to feature extraction / data representation, much like the Tripletbase loss. Moreover, the Dloss avoids some Tripletbased loss disadvantages, such as the use of hard samples, and it is nonparametric. Also, tripletbased losses have tricky parameter tuning, which can lead to slow convergence. The Dloss drawback is related to memory consumption since it requires a large batch size which can hinder the use of larger models.
The Dloss surpassed three other popular loss functions (Triplets Softhard Loss, MultiSimilarity Loss, and SoftmaxLoss) in two out of three popular benchmark problems (MNISTFASHION and CIFAR10) and presented comparable results on a challenging scenario with the CASIAIrisV4 dataset.
Conflict of interest statement
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
 Learning a similarity metric discriminatively, with application to face verification. In Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 539–546. Cited by: §1, §3.2.
 Biometric decision landscapes. Technical report University of Cambridge, Computer Laboratory. Cited by: §1, §3.1.
 Insights into the results of MICHE Imobile iris challenge evaluation. Pattern Recognition, pp. 286–304. Cited by: §1.
 Deep learning. Adaptive Computation and Machine Learning series, MIT Press. Cited by: §1.
 Tripletcenter loss for multiview 3d object retrieval. In Conference on Computer Vision and Pattern Recognition, pp. 1945–1954. Cited by: §1, §2.2.1, §3.2.
 Learning multiple layers of features from tiny images. Technical report University of Toronto. Cited by: §1, §4.2.

Quadrupletwise image similarity learning
. In International Conference on Computer Vision, pp. 249–256. Cited by: §1, §3.2.  Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
 MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2. Cited by: §4.2.
 Deep periocular representation aiming video surveillance. Pattern Recognition Letters, pp. 2–12. Cited by: §1, §1.
 A metric learning reality check. In European Conference on Computer Vision, pp. 681–699. Cited by: §4.1.
 Deep metric learning via lifted structured feature embedding. In conference on computer vision and pattern recognition, pp. 4004–4012. Cited by: §2.2.2.
 Deep face recognition. In British Machine Vision Conference (BMVC), pp. 1–12. Cited by: §1, §1, §1, §2.2.1, §3.2.
 L2constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507. Cited by: §1.
 Facenet: a unified embedding for face recognition and clustering. In conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1, §2.2.1, §2.2.1.
 CNN features offtheshelf: an astounding baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813. Cited by: §1.
 Embedding deep metric for person reidentification: a study against large variations. In ECCV 2016  Proceedings, Part I, Lecture Notes in Computer Science, Vol. 9905, pp. 732–748. Cited by: §2.2.1.

Multimodal feature level fusion based on particle swarm optimization with deep transfer learning
. InCongress on Evolutionary Computation (CEC)
, pp. 1–8. Cited by: §1.  Improved deep metric learning with multiclass npair loss objective. In Neural Information Processing Systems, pp. 1857–1865. Cited by: §1, §2.2.2.
 Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. Cited by: §4.4.
 Learning deep embeddings with histogram loss. In Neural Information Processing Systems, pp. 4170–4178. Cited by: §1, §2.2.2.
 Normface: l2 hypersphere embedding for face verification. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1041–1049. Cited by: §1, §1, §2.1.
 Ranked list loss for deep metric learning. In Conference on Computer Vision and Pattern Recognition, pp. 5207–5216. Cited by: §1, §2.2.1, §2.2.1.

Deep metric learning by online soft mining and classaware attention.
Proceedings of the AAAI Conference on Artificial Intelligence
33 (01), pp. 5361–5368. Cited by: §2.2.1.  Multisimilarity loss with general pair weighting for deep metric learning. In Conference on Computer Vision and Pattern Recognition, pp. 5022–5030. Cited by: §1, §1, §2.2.1, §2.2.2.
 FashionMNIST: a novel image dataset for benchmarking machine learning algorithms. Preprint arXiv:1708.07747. Cited by: §4.2.
 Deep metric learning for person reidentification. In International Conference on Pattern Recognition, pp. 34–39. Cited by: §1, §2.2.2, §2.2.2.
 HardAware Deeply Cascaded Embedding. International Conference on Computer Vision 2017October, pp. 814–823. External Links: ISBN 9781538610329, ISSN 15505499 Cited by: §2.2.1.