Log In Sign Up

A Decidability-Based Loss Function

by   Pedro Silva, et al.

Nowadays, deep learning is the standard approach for a wide range of problems, including biometrics, such as face recognition and speech recognition, etc. Biometric problems often use deep learning models to extract features from images, also known as embeddings. Moreover, the loss function used during training strongly influences the quality of the generated embeddings. In this work, a loss function based on the decidability index is proposed to improve the quality of embeddings for the verification routine. Our proposal, the D-loss, avoids some Triplet-based loss disadvantages such as the use of hard samples and tricky parameter tuning, which can lead to slow convergence. The proposed approach is compared against the Softmax (cross-entropy), Triplets Soft-Hard, and the Multi Similarity losses in four different benchmarks: MNIST, Fashion-MNIST, CIFAR10 and CASIA-IrisV4. The achieved results show the efficacy of the proposal when compared to other popular metrics in the literature. The D-loss computation, besides being simple, non-parametric and easy to implement, favors both the inter-class and intra-class scenarios.


Curricular SincNet: Towards Robust Deep Speaker Recognition by Emphasizing Hard Samples in Latent Space

Deep learning models have become an increasingly preferred option for bi...

Learning Efficient Representations for Keyword Spotting with Triplet Loss

In the past few years, triplet loss-based metric embeddings have become ...

Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition

Recently, speaker embeddings extracted from a speaker discriminative dee...

NPT-Loss: A Metric Loss with Implicit Mining for Face Recognition

Face recognition (FR) using deep convolutional neural networks (DCNNs) h...

σ^2R Loss: a Weighted Loss by Multiplicative Factors using Sigmoidal Functions

In neural networks, the loss function represents the core of the learnin...

Contraction Mapping of Feature Norms for Classifier Learning on the Data with Different Quality

The popular softmax loss and its recent extensions have achieved great s...

Spectrum Translation for Cross-Spectral Ocular Matching

Cross-spectral verification remains a big issue in biometrics, especiall...

1 Introduction

Deep learning disrupted the computer vision field, especially regarding feature extraction and image representation. In this context, convolutional neural networks (CNNs) play a significant role 

(Krizhevsky et al., 2009; LeCun et al., 2015)

. Once deep CNNs are trained on a massive amount of data, it becomes a candidate to be used as a deep feature descriptor, not only in the domain or task in which the CNN was trained but in every computer vision problem, thanks to transfer learning techniques 

(Goodfellow et al., 2016). However, the loss function in which these CNNs were originally pre-trained, strongly influences the extracted deep representation features. The loss function is of paramount importance for training a CNN model and several approaches have been proposed in the literature (Sharif Razavian et al., 2014; Schroff et al., 2015; Parkhi et al., 2015; Wang et al., 2017, 2019a, 2019b). Among them, one very popular is the cross-entropy loss (Goodfellow et al., 2016), also known as softmax-loss since it is often preceded by a softmax operation. Although originally designed for classification problems, it has been very successful in learning deep representative features along with CNN models (Krizhevsky et al., 2009; Sharif Razavian et al., 2014).

The softmax loss is widely used for several reasons, namely it is easy to interpret and implement, fast convergence, and for working well with different batch sizes (large or very small). Even though it is not designed for learning feature representations, the learned features are powerful enough for many tasks, such as face recognition (Parkhi et al., 2015; Wang et al., 2017), ocular recognition (Luz et al., 2018; Silva et al., 2018), among others. However, in tasks in which it is necessary to know how close or how far the samples are concerning each other (such as the biometric verification task) on high-dimensional spaces, this type of loss is not the best option. There are efforts in the literature to adapt the softmax loss (Ranjan et al., 2017; Wang et al., 2017) to the task, but contrastive and triplet-based losses are still the ones that offer the best gains.

Contrastive loss better captures the relationship between two samples projected in a space (Euclidean, for example), by penalizing, during learning, negative samples (from impostors) and rewarding samples from the same (genuine) category (Chopra et al., 2005). Likewise, triplet-based loss explores the concept of similarities and dissimilarities between samples in a space, adding anchor elements. There are many triplet-based losses in the literature, such as triplet-center loss (He et al., 2018), quadruplet loss (Law et al., 2013) and in general, triplet-based loss produce better results, overcoming other pair-wise losses such N-pairs loss (Sohn, 2016), binomial deviance loss (Yi et al., 2014), histogram loss (Ustinova and Lempitsky, 2016) and Multi-Similarity Loss  (Wang et al., 2019b).

Low convergence represents a major problem for triplet-based losses. Besides, given a set of samples, it is not trivial to find positive or negative instances to use as hard pairs, nor is it easy to fine-tune the margin that separates them (Parkhi et al., 2015). Notwithstanding, tripled-based losses are still the most popular losses in the literature, despite their limitations.

In this paper, we propose a new loss function, the D-loss, based on the decidability index (Daugman, 2000). Daugman Daugman (2000) highlights this index as a quality measure for biometric systems, which is often used in the literature for this purpose (De Marsico et al., 2018; Luz et al., 2018). The D-loss assures both inter-class and intra-class separability, and, unlike triplet-loss, avoids the difficult problem of finding hard positive and negatives samples. It also provides better convergence, since it does not require parameter adjustment, and is easy to implement.

The contributions of this paper can be summarized as follows:

  • A new loss function, the D-Loss, based on the decidability index, which is intuitive and easy to compute and implement. The D-loss is also suitable for training models which aim at data representation and, unlike triplets loss, it favors both inter-class separability and intra-class approximation.

  • Under the same conditions, the D-loss overcomes three other popular loss functions (Triplets Soft-hard Loss, Multi-Similarity Loss, and Softmax-Loss) in MNIST-FASHION, CIFAR-10 and presents comparable results for MNIST. Therefore, a competitive loss function.

  • The D-loss has converged well on higher capacity networks such as the networks of the EfficientNet family. Results with D-loss overcome all the three other popular loss functions (Triplets Soft-hard Loss, Multi-Similarity Loss, and Softmax-Loss) evaluated here, for the ocular database CASIA-IrisV4 on the same model (B0) and training conditions.

The manuscript is organized as follows: Section 2 presents a background based on related works. Section 3 describes the methodology, followed by the experiments and discussion in Section 4. Finally, the conclusion appears in Section 5.

2 Background and Related Works

In this section, two categories of loss functions are introduced: one designed for classification tasks and another for verification tasks, usually carried out in a pairwise fashion.

2.1 Classification Losses – Softmax Loss

Given a training set , and their respective labels , where is the number of classes in the problem, the softmax loss function is given by:


in which is the learned feature map, or embedding, is the dimension of the embedding. The and , , are the weights and biases of the last fully layer, respectively (see Figure 1). This formulation enforces features to have larger magnitudes and a radial distribution (Wang et al., 2017). Thus, it does not optimize the features to have small dissimilarity scores for positive pairs or higher dissimilarity scores for the negative ones. Other approaches tried to mitigate such effects on softmax-loss.

Figure 1: Cross-entropy loss applied over a Softmax activation. To turn easier its reference in text, the name “Softmax loss” is used.

2.2 Pair-wise losses

2.2.1 Triplet losses

A triplet is a set of three components: an anchor , a sample of the same class (positive), and one from another class (negative) (Parkhi et al., 2015; Schroff et al., 2015; Wang et al., 2019a). The positive and negative pairs share the same anchor, and the aim is to make the embedding of the positive pair get closer, and the negative to separate according to


in which is a desired margin, and is the squared L2 distance. The triplet loss is given by


for a set of triplets. Nonzero values appear when the inequality (2) does not hold for a given margin . Figure 2 exhibits a sketch of the process, with the green arrow indicating the distance between a positive pair and in violet the distance between a negative.

Figure 2: Triplet loss optimization schema. The same network is applied to all instances.

As in other pair-based losses, sampling is an important step (Wang et al., 2019b). The dataset must have sufficient recordings of the same class to use as positive pairs, otherwise the triplets are not viable. Moreover, randomly chosen and easily satisfies the condition (2) since the euclidean distances between the encodings of images of different people are likely to be large whilst those from the same person tends to be small. The main drawback of the random strategy is low convergence due to uninformative samples with negligible contributions (Wang et al., 2019b).

One solution is to perform hard mining by choosing in a manner that , and with . This gives positive (negative) pairs whose distances to the anchor are the highest (smallest). In this case it is possible that and the margin starts to play an important role.

The hard mining process demands high computational power and may lead to biased positive and negative pictures ranging from mislabeled to poorly imaged samples. Some strategies to overcome these issues are: (i) to generate triplets offline every steps and compute argmin and argmax on a subset of the data; (ii) generate triplets online seeking hard pairs from the mini-batch; (iii) semi-hard negative samples (Schroff et al., 2015).

In some problems, images from the same class may present high dissimilarity when compared to images from different classes due to differences in lighting, color, and pose. The Ranked List Loss (RLL) (Wang et al., 2019a) and the Online Soft Mining (OSM) (Wang et al., 2019) preserve intra-class data distribution instead of shrinking the encodings into a single point.

The Triplet-Center Loss (TCL) (He et al., 2018) replaces the hard negatives with the nearest negative centers from the center loss. Other mining strategies explore only moderate positive pairs (Shi et al., 2016), or consider different depths of similarity with sub-networks (Yuan et al., 2017).

2.2.2 Multi-Similarity Loss

There are other pair-based loss functions, such N-pairs loss (Sohn, 2016), binomial deviance loss (Yi et al., 2014), histogram loss (Ustinova and Lempitsky, 2016) and Multi-Similarity Loss (Wang et al., 2019b). Among them, outstanding results have been reported with Multi-Similarity Loss. Therefore, we consider the Multi-Similarity Loss in this work for comparison purposes..

The Multi-Similarity Loss is based on the pair weighting formulation, which analyzes simultaneously three types of similarities before making a decision: self-similarity, negative relative similarity, and positive relative similarity. The approach consists of a two-step scheme: hard pairs selection (hard mining) and weighting. First, pairs are selected by means of positive relative similarity, then the selected pairs are weighted using both self-similarity and negative relative similarity, inspired by binomial deviance loss (Yi et al., 2014) and lifted structure loss (Oh Song et al., 2016). A framework is proposed to integrate the two steps, the General Pair Weighting (GPW) framework.

3 Methodology

In this section, the decidability metric is formally described along with the training and optimization strategies.

We do not intend to compare our results with state-of-the-art metric learning methods, but rather to evaluate the robustness and discriminative potential of the representations obtained with the D-loss. We are especially interested in assessing D-loss performance in a biometric-like problem.

3.1 Decidability

A typical recognition system, such as biometric recognition, can be analyzed from four different perspectives: (i) False Accept Rate (FAR) in which an impostor is accepted as genuine, (ii) False Reject Rate (FRR) in which an genuine individual is classified as an impostor, (iii) Correct Accept Rate (CAR) in which an genuine individual is accepted, and (iv) Correct Reject Rate (CRR) in which an impostor is correctly not accepted. The FAR and FRR are related to a system error and they are called Type I and Type II error respectively. It is worth mentioning that we discuss the distance function from a point of view of dissimilarity in this work. Figure 

3 shows the relation of the four perspectives when analyzed using the distribution of the scores.

Figure 3: Genuine and Impostor distributions. The Euclidean Distance is used to calculate the dissimilarity between two pairs. If the Euclidean Distance calculated is greater than a criterion, then, the individual is rejected, otherwise, it is accepted.

To illustrate, let be a gallery of samples. After propagating

through a neural network, one obtains

, in which is the embedding function, and is a representation in the embedding space. The dissimilarity score curves (genuine and impostor) use as input. The scores are generated by computing the distance of pairs of samples, on an all against all fashion, as shown in Figure  4. The scores are further mapped into categories (or bins), and a histogram of the frequencies is generated by


where is the frequency count for a specific bin , the number of samples that meets the bin conditions and the score related to the distance between a pair of samples.

Figure 4: Process of construction of a Histogram based on images dissimilarities.

Let and be the two distribution curves shown in Figure 3 regarding genuine and impostor, respectively. The

is the probability density function of the dissimilarity scores, computed from the distance (through some distance metric) between two instances of the same class. Likewise, the

curve represents the probability density of the dissimilarity scores, computed from the distance between instances of different classes.

Thus, the four areas under the two distribution curves, such as illustrated in Figure 3, express the probabilities of each decision metric (FAR, CAR, FRR, CRR).

The overlay of the authentic and the impostor scores is expected in a real scenario. The criterion threshold (C) can be manipulated and the FAR, FRR, CAR, and CRR updated according to users’ needs. By manipulating this criterion and plotting the results in function of CAR and FAR, whose sum should result in one, a Receiver Operating Characteristic (ROC) or Neyman-Pearson curve can be created.

Ideally, the decision curve should be positioned as close as possible to the top left corner of the decision strategy curve in which the Correct Accept is close to 100% and the False Accept (error), close to 0%. However, this does not happen in the real world. To overcome those issues, arises the decidability index.

According to Daugman Daugman (2000)

, the decidability along with the distribution curves are good descriptors for the decision curve and can be better than FAR and FRR to assess pattern recognition systems performance.

The decidability can be defined as a correlation of genuine and impostor scores as defined in


in which and are the means of the two distributions (histogram curve of genuine and impostor scores) and and

are their corresponding standard deviations. The decidability indicates how far the genuine distribution scores curve is from the impostor and the overlap between these two curves (as shown in Figure 


Although the decidability index, as well as the other metrics presented in this subsection, is widely used in the biometric field, one could expand it to most pattern recognition problems. Since the decidability is independent of decision thresholds and quantifies, in a single scalar, the distance and overlap among two distribution curves, one hypothesis is that it might work as a loss function for training models aiming at image/signal representation.

3.2 D-loss

Many computer vision problems use deep learning models for data representation. Several authors use the cross-entropy loss (softmax-loss) to fine-tune models to a new domain or task, and it is also common to train models from scratch. To calculate the loss in a common supervised learning problem, with the cross-entropy loss, one uses the softmax operation to drive the output to probabilities (of each class), and the loss function aims to raise the accuracy on training data, according to a class hot encoded array. Thus, the problem is reduced to a classification problem, in with the main goal is not focused on the generation of better representation for the inputs, but on correctly classifying classes.

The state-of-the-art losses, triple-based ones, employ the concept of the anchor. Given an anchor sample, triplet-based loss aims to learn an embedding space, where positive pairs are forced to be closer to the anchor than the negative, by a margin. The pair-based losses, such as constrative (Chopra et al., 2005), triplet-center (He et al., 2018), and quadruplet (Law et al., 2013), put the focus on the generation of embeddings.

Ideally, those pairs tend to bring more information and enhance the power of the model to represent the data.

The pair selection represents a major issue for that kind of loss, since finding hard positive or negative samples related to anchor is not a trivial task. They also suffer from low convergence, and adjusting the margin parameter is not trivial (Parkhi et al., 2015).

We propose a new loss function, the D-loss. Differently from others, it makes use of all samples in a batch and does not rely on the concept of an anchor. The optimization of Eq. (5) separates and

and reduces the variances

and at the same time, thus improving both intra-class and inter-class scenarios as shown in Figure 5.

Figure 5: D-loss optimization schema.

The decidability index can rise to infinity, and the higher its value, the best is the separation between the impostor and genuine distributions. To better suit a minimization objective, we define D-loss as follows:


with as defined in Eq. (5), and the embedding function based on Euclidean Distance


The is given by , in which is the number of samples, is the training data and is the class of the sample. It is worth highlighting that the computation of the loss considers the entire batch, and the objective is to minimize the .

3.3 Training and Weights Updating

The training process using the D-loss is similar to training a traditional neural network. The main difference relies on the fact that the last layer of the CNN model is the embedding layer instead of the commonly used neurons that represent the classes of the problem. Still, the process of updating the weights is equal to the traditional.

Several batches are created from the training data. Their size is a hyper-parameter of the network and the selection of the data within a batch is random. The weights are updated after the processing of each mini-batch. First, a distance is calculated for each pair within a mini-batch, and, based on those distances, the genuine and impostor distribution curves are computed. Subsequently, the mean and standard deviation of curves are calculated to obtain the decidability. The learning process aims to minimize decidability. The entire process is presented in Algorithm 1.

1 Mini-Batch Setting: The batch size , the number of classes . Input: , the distance function , the embedding function , the learning rate Output: Updated . Step 1: Compute the embeddings by feeding-forward all images . Step 2: Compute the pair-wise distances. foreach  do
2      foreach  do
3           pdist[i, j] = d(, )
5           Step 3: Compute the genuine and impostor scores: foreach  do
6                foreach  do
7                     if  == and !=  then
8                          genuine.push(pdist[i, j])
9                          if  !=  then
10                               impostor.push(pdist[i, j])
Step 4: Compute the decidability (Eq. (5)). Step 5: Compute (Eq. (6)). Step 6: Gradient computation and back-propagation to update the parameters of the network.            and           
Algorithm 1 D-loss on one mini-batch.

4 Experiments

This section presents the details of the experiments, the results, and the implications of the proposed methodology. Source code will be available after acceptance.

4.1 Implementation Details

We implement the CNN functions and loss function with the TensorFlow/Keras Library. The CNN models are trained on a GPU GeForce Titan X with 12GB. In order to perform a fair comparison among losses and avoid experimental flaws such as pointed out by Musgrave

et al. Musgrave et al. (2020), all losses are evaluated under the same conditions, fixing the network architecture, instances in the batches (same seed), optimization algorithm (Adam optimizer, initial learning rate of 10), and same embedding dimension size.

4.2 Datasets

MNIST, Fashion-MNIST, and CIFAR-10 are used as proof of concept, mainly to investigate the convergence of D-loss in different domains. Also, these datasets are common benchmarks in machine learning.

We also employ a popular periocular benchmark to evaluate the proposed loss in a biometric scenario: CASIA-IrisV4. For a fair comparison, a 3-fold classification strategy is evaluated. All 249 individuals are split equally in three groups without overlap. All samples of an individual is just in one fold. To conduct the experiments, we set 30% of the train data for validation.


The MNIST dataset of handwritten digits (LeCun et al., 2010) consists of gray-scale images with a size of 28x28 each. There are ten classes (zero to nine), and each contains 6,000 images for training and 1,000 for testing.


Similar to MNIST, the Fashion-MNIST (Xiao et al., 2017) has gray-scale images of 28x28 resolution, describing fashion items (shoes, coat, etc.). It is composed of 10 classes with the same MNIST distribution over the classes.


The CIFAR-10 (Krizhevsky et al., 2009) is a 10 class problem, with 6,000 images per class. It is a collection of 60,000 colored images, in which 10,000 images are for testing and 50,000 for training. Each image is a RGB of size 32x32.


The CASIA-IrisV4 contains six subsets and is an extension of CASIA-IrisV3 dataset. In this work, the CASIA-Iris-Interval (subset from CASIA-IrisV3) is used to report the results. The iris images are from 249 different subjects with a total of 1,438 JPEG images captured under near infrared illumination.

4.3 Evaluation Metrics

Metrics such as precision, recall, and accuracy are not the most suitable ones for biometric learning problems. Therefore, we use the False Acceptance Rate (FAR), False Rejection Rate (FRR), Equal Error Rate (EER), and Recall@K.

The EER is the point in which the FRR and FAR have the same values on the decision error trade-off curve. The FRR is the rate of incorrect rejections over different thresholds (zero to one), while the FAR is the rate of incorrect acceptances. The EER is the point of intersection of both FAR and FRR.

4.4 CNN Architecture and Model Training

We run our experiments on three different architectures. Two architectures are small networks (Figure 6) and each one matches the input size resolution of MNIST and CIFAR datasets ( and ). The architectures used with the MNIST, Fashion, and CIFAR-10 make use of simple operations that are well understood in the literature. The rationale to use such architectures (with simple convolutional blocks) is to emphasize the impact of the loss function.

An architecture with more capacity is selected for the CASIA-IrisV4 dataset. The experiments are run on the EfficientNet family network (B0 model) (Tan and Le, 2019). The architecture is larger, more sophisticated and has an input size resolution of .

(a) MNIST-architecture proposed in which conv1, conv2 and conv3 have filters size equal to 2x2, with ReLU

activation and padding to avoid reduction after the convolution. All

poolings are with the max function and window size of 2. Dropout of 30%.

(b) CIFAR-architecture proposed in which conv1, conv2 and conv3 have filters size equal to 3x3, with ReLU activation and padding to avoid reduction after the convolution. All poolings are with the max function and window size of 2. Dropout of 20%.

Figure 6: Architectures used during experiments.

All architectures have an embedding layer with dimension 256 (256-feature array output) with L2 normalization. The MNIST-architecture has 100,010 parameters, the CIFAR-architecture 567,082 parameters, and CASIA-V4-architecture 4,377,500 parameters. Before the loss layer, a Lambda Layer with an L2 normalization is used.

With these three architectures, we are able to evaluate the D-loss in different scenarios and from shallow and simple networks to one state-of-the-art network such as EfficientNet B0.

No data augmentation is implemented during training, and the euclidean distance is used to calculate the dissimilarity scores.

We employ Adam optimizer to train the models with a learning rate of , and all images normalized to the range

. The number of epochs is the same for all experiments, and dataset dependent:

epochs for the MNIST, epochs for the Fashion-MNIST, epochs for CIFAR-10 and 1,000 epochs for CASIA-V4.

An initial experiment is performed on the MNIST dataset, to assess the impact of the batch size on both D-loss and Triplet-loss. We evaluate the batch sizes: 300, 400, and 500. As shown in Table 1, the batch size equal to 400 performed better on validation data for the D-loss. For triplet loss, results are very similar, both for batch sizes 300 and 400. Thus, a batch size of 400 is fixed for all losses to maintain the same evaluation setup in all experiments related to MNIST, Fashion and CIFAR-10. For the experiments on the CASIA-V4, a batch size of 150 is employed due to the network architecture size and hardware constraints.



Triplets Soft-hard Loss

EER (%)


EER (%)

300 1.07 7.92 0.70 7.63
400 0.47 9.31 0.74 7.25
500 1.01 8.22 0.83 7.10
Table 1: Batch size investigation on MNIST Dataset. Results on validation set (30% of train set).

4.5 Results and Literature Comparison

Tables 2 and 3 present a comparison between the D-loss, the Triplets soft-hard loss, the Multi Similarity Loss, and the Softmax-loss. The proposed approach outperformed the others in two out of four evaluated scenarios. On the remaining it exhibit comparable performance.

Dataset D-loss MS Loss TS Loss* Softmax
MNIST (10) 1.04 1.06 0.91 1.50
Fashion (10) 5.38 5.82 5.94 7.58
CIFAR-10 13.01 14.11 20.01 14.08
CASIA-V4 7.96 3.43 7.84 1.29 38.55 53.25 9.4 1.92
Table 2: Reported results in terms of Equal Error Rate (EER) in %. MS = Multi-Similarity; TS = Triplets Soft-hard; * = Model did not converge; = No statistical significance.
Dataset R@K D-loss MS Loss TS Loss* Softmax
MNIST (10) 1 0.99 0.98 0.99 0.99
2 0.99 0.99 0.99 1.00
4 0.99 0.99 1.00 1.00
8 0.99 1.00 1.00 1.00
Fashion (10) 1 0.88 0.88 0.89 0.89
2 0.93 0.93 0.94 0.94
4 0.96 0.96 0.96 0.96
8 0.97 0.98 0.97 0.98
CIFAR-10 1 0.77 0.77 0.79 0.74
2 0.84 0.84 0.85 0.82
4 0.89 0.89 0.90 0.88
8 0.92 0.92 0.93 0.92
CASIA-V4 1 0.99 0.01 0.95 0.03 0.66 0.57 0.99 0.01
2 0.99 0.00 0.98 0.01 0.67 0.58 0.99 0.01
4 1.00 0.01 0.99 0.01 0.67 0.58 0.99 0.01
8 1.00 0.01 1.00 0.00 0.67 0.58 1.00 0.00
16 1.00 0.01 1.00 0.00 0.67 0.58 1.00 0.00
32 1.00 0.00 1.00 0.00 0.67 0.58 1.00 0.00
Table 3: Reported results in terms of Recall@K (R@K). MS = Multi-Similarity; TS = Triplets Soft-hard; * = Model did not converged.

All scenarios present training overfitting as well. However, for the simple datasets (MNIST and Fashion-MNIST), a balanced degree of specialization and generalization is observed for all three losses.

4.6 Discussion

The main advantage of the D-loss over the triplets is that it does not require hard samples selection, which is a costly and difficult operation. That operation substantially increases the complexity of triplet-based losses.

A considerable disadvantage of the proposed loss is related to batch size. While the majority of the pair-based distance losses need an anchor and at least a negative sample, the proposed approach depends on a large number of both positive and negative samples to create the similarity (or dissimilarity) score distribution curves for the loss computation.

Regarding memory consumption, all pair-based losses analyzed here are comparable. While the D-loss and the triplets store distances between pairs, the multi-similarity loss stores the multiplication of the embeddings.

The impact of training with D-loss is more apparent in the genuine and impostor distributions curves, as one can see in Figure 7. Figure 7 (a) corresponds to the non-trained model, and Figure 7 (b) is the result after training.

Figure 7: Distribution scores before (a) and after (b) training the model with D-loss on MNIST dataset. More distance between the mean of the curves implies better inter-class discrimination and the reduction in the width of the genuine distribution curve implies intra-class improvement.

5 Conclusion

Based on the decidability index, which is intuitive, easy to compute, and implement, we propose the D-loss as an alternative to triplet-based losses and the softmax-loss. The D-loss function is more suitable than softmax-loss for training models aiming to feature extraction / data representation, much like the Triplet-base loss. Moreover, the D-loss avoids some Triplet-based loss disadvantages, such as the use of hard samples, and it is non-parametric. Also, triplet-based losses have tricky parameter tuning, which can lead to slow convergence. The D-loss drawback is related to memory consumption since it requires a large batch size which can hinder the use of larger models.

The D-loss surpassed three other popular loss functions (Triplets Soft-hard Loss, Multi-Similarity Loss, and Softmax-Loss) in two out of three popular benchmark problems (MNIST-FASHION and CIFAR-10) and presented comparable results on a challenging scenario with the CASIA-IrisV4 dataset.

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.


  • S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 539–546. Cited by: §1, §3.2.
  • J. Daugman (2000) Biometric decision landscapes. Technical report University of Cambridge, Computer Laboratory. Cited by: §1, §3.1.
  • M. De Marsico, M. Nappi, F. Narducci, and H. Proença (2018) Insights into the results of MICHE I-mobile iris challenge evaluation. Pattern Recognition, pp. 286–304. Cited by: §1.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. Adaptive Computation and Machine Learning series, MIT Press. Cited by: §1.
  • X. He, Y. Zhou, Z. Zhou, S. Bai, and X. Bai (2018) Triplet-center loss for multi-view 3d object retrieval. In Conference on Computer Vision and Pattern Recognition, pp. 1945–1954. Cited by: §1, §2.2.1, §3.2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report University of Toronto. Cited by: §1, §4.2.
  • M. T. Law, N. Thome, and M. Cord (2013)

    Quadruplet-wise image similarity learning

    In International Conference on Computer Vision, pp. 249–256. Cited by: §1, §3.2.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
  • Y. LeCun, C. Cortes, and C. Burges (2010) MNIST handwritten digit database. ATT Labs [Online]. Available: 2. Cited by: §4.2.
  • E. Luz, G. Moreira, L. A. Z. Junior, and D. Menotti (2018) Deep periocular representation aiming video surveillance. Pattern Recognition Letters, pp. 2–12. Cited by: §1, §1.
  • K. Musgrave, S. Belongie, and S. Lim (2020) A metric learning reality check. In European Conference on Computer Vision, pp. 681–699. Cited by: §4.1.
  • H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In conference on computer vision and pattern recognition, pp. 4004–4012. Cited by: §2.2.2.
  • O. M. Parkhi, A. Vedaldi, and A. Zisserman (2015) Deep face recognition. In British Machine Vision Conference (BMVC), pp. 1–12. Cited by: §1, §1, §1, §2.2.1, §3.2.
  • R. Ranjan, C. D. Castillo, and R. Chellappa (2017) L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507. Cited by: §1.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1, §2.2.1, §2.2.1.
  • A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson (2014) CNN features off-the-shelf: an astounding baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813. Cited by: §1.
  • H. Shi, Y. Yang, X. Zhu, S. Liao, Z. Lei, W. Zheng, and S. Z. Li (2016) Embedding deep metric for person re-identification: a study against large variations. In ECCV 2016 - Proceedings, Part I, Lecture Notes in Computer Science, Vol. 9905, pp. 732–748. Cited by: §2.2.1.
  • P. Silva, E. Luz, L. A. Zanlorensi, D. Menotti, and G. Moreira (2018)

    Multimodal feature level fusion based on particle swarm optimization with deep transfer learning


    Congress on Evolutionary Computation (CEC)

    pp. 1–8. Cited by: §1.
  • K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Neural Information Processing Systems, pp. 1857–1865. Cited by: §1, §2.2.2.
  • M. Tan and Q. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. Cited by: §4.4.
  • E. Ustinova and V. Lempitsky (2016) Learning deep embeddings with histogram loss. In Neural Information Processing Systems, pp. 4170–4178. Cited by: §1, §2.2.2.
  • F. Wang, X. Xiang, J. Cheng, and A. L. Yuille (2017) Normface: l2 hypersphere embedding for face verification. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1041–1049. Cited by: §1, §1, §2.1.
  • X. Wang, Y. Hua, E. Kodirov, G. Hu, R. Garnier, and N. M. Robertson (2019a) Ranked list loss for deep metric learning. In Conference on Computer Vision and Pattern Recognition, pp. 5207–5216. Cited by: §1, §2.2.1, §2.2.1.
  • X. Wang, Y. Hua, E. Kodirov, G. Hu, and N. M. Robertson (2019) Deep metric learning by online soft mining and class-aware attention.

    Proceedings of the AAAI Conference on Artificial Intelligence

    33 (01), pp. 5361–5368.
    Cited by: §2.2.1.
  • X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott (2019b) Multi-similarity loss with general pair weighting for deep metric learning. In Conference on Computer Vision and Pattern Recognition, pp. 5022–5030. Cited by: §1, §1, §2.2.1, §2.2.2.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. Preprint arXiv:1708.07747. Cited by: §4.2.
  • D. Yi, Z. Lei, S. Liao, and S. Z. Li (2014) Deep metric learning for person re-identification. In International Conference on Pattern Recognition, pp. 34–39. Cited by: §1, §2.2.2, §2.2.2.
  • Y. Yuan, K. Yang, and C. Zhang (2017) Hard-Aware Deeply Cascaded Embedding. International Conference on Computer Vision 2017-October, pp. 814–823. External Links: ISBN 9781538610329, ISSN 15505499 Cited by: §2.2.1.