I Introduction
Siamese neural networks have been found very effective for feature extraction [27], metric learning [19], fewshot learning [18], and feature tracking [13]. A Siamese network includes several, typically two or three, backbone neural networks which share weights [24] (see Fig. 1). Different loss functions have been proposed for training a Siamese network. Two commonly used ones are triplet loss [24] and contrastive loss [12] which are displayed in Fig. 1.
We generally start with considering two samples named an anchor and one of its neighbors from the same class, and two more samples named the same anchor and a distant counterpart from a different class. The triplet loss considers the anchorneighbordistant triplets while the contrastive loss deals with the anchorneighbor and anchordistant pairs of samples. The main idea of these loss functions is to pull the samples of every class toward one another and push the samples of different classes away from each other in order to improve the classification results and hence the generalization capability. We will introduce these losses in Section IIC in more depth.
The Fisher Discriminant Analysis (FDA) [6] was first proposed in [5]
. FDA is a linear method based on generalized eigenvalue problem
[10]and tries to find an embedding subspace that decreases the variance of each class while increases the variance between the classes. As can be observed, there is a similar intuition behind the concepts of both the Siamese network and the FDA, where they try to embed the data in a way that the samples of each class collapse close together
[11]but the classes fall far away. Although FDA is a wellknown statistical method, it has been recently noticed in the literature of deep learning
[4, 3].Noticing the similar intuition behind the Siamese network and FDA, we propose two novel loss functions for training Siamese networks, which are inspired by the theory of the FDA. We consider the intra and interclass scatters of the triplets instead of their norm distances. The two proposed loss functions are Fisher Discriminant Triplet (FDT) and Fisher Discriminant Contrastive (FDC) losses, which correspond to triplet and contrastive losses, respectively. Our experiments show that these loss functions exhibit very promising behavior for training Siamese networks.
The remainder of the paper is organized as follows: Section II reviews the foundation of the Fisher criterion, FDA, Siamese network, triplet loss, and contrastive loss. In Sections IIIB and IIIC, we propose the FDT and FDC loss functions for training Siamese networks, respectively. In Section IV, we report multiple experiments on different benchmark datasets to demonstrate the effectiveness of the proposed losses. Finally, Section V concludes the paper.
Ii Background
Iia Fisher Criterion
Assume the data include classes where the th class, with the sample size , is denoted by . Let the dimensionality of data be . Consider a dimensional subspace (with ) onto which the data are projected. We can define intra (within) and interclass (between) scatters as the scatter of projected data in and between the classes. The Fisher criterion is increasing and decreasing with the intra and interclass scatters, respectively; hence, by maximizing it, one can aim to maximize the interclass scatter of projected data while minimizing the intraclass scatter. There exist different versions of the Fisher criterion [7]. Suppose is the projection matrix onto the subspace, then the trace of the matrix can be interpreted as the variance of the projected data [9]. Based on this interpretation, the most popular Fisher criterion is defined as follows [5, 25]
(1) 
where denotes the trace of matrix and and are the inter and intraclass scatter matrices, respectively, defined as
(2)  
(3) 
Some other versions of Fisher criterion are [7]
(4)  
(5) 
where the former is because the solution to maximizing (1) is the generalized eigenvalue problem (see Section IIB
) whose solution can be the eigenvectors of
[8]. The reason for latter is because (1) is a RayleighRitz quotient [23] and its denominator can be set to a constant [9]. The Lagrange relaxation of the optimization would be similar to (5).IiB Fisher Discriminant Analysis
is defined as a linear transformation which maximizes the criterion function (
1). This criterion is a generalized RayleighRitz quotient [23] and we may recast the problem to [9](6)  
subject to 
where
is the identity matrix. The Lagrange relaxation of the problem can be written as follows
(7) 
where is a diagonal matrix which includes the Lagrange multipliers [1]. Setting the derivative of Lagrangian to zero gives
(8) 
which is the generalized eigenvalue problem where the columns of and the diagonal of are the eigenvectors and eigenvalues, respectively [8]. The column space of is the FDA subspace.
IiC Siamese Network and Loss Functions
IiC1 Siamese Network
Siamese network is a set of several (typically two or three) networks which share weights with each other [24] (see Fig. 1). The weights are trained using a loss based on anchor, neighbor (positive), and distant (negative) samples, where anchor and neighbor belong to the same class, but the anchor and distant tiles are in different classes. We denote the anchor, neighbor, and distant samples by , , and , respectively. The loss functions used to train a Siamese network usually make use of the anchor, neighbor, and distant samples, trying to pull the anchor and neighbor towards one another and simultaneously push the anchor and distant tiles away from each other. In the following, two different loss functions are introduced for training Siamese networks.
IiC2 Triplet Loss
The triplet loss uses anchor, neighbor, and distant. Let be the output (i.e., embedding) of the network for the input . The triplet loss tries to reduce the distance of anchor and neighbor embeddings and desires to increase the distance of anchor and distant embeddings. As long as the distances of anchordistant pairs get larger than the distances of anchorneighbor pairs by a margin , the desired embedding is obtained. The triplet loss, to be minimized, is defined as [24]
(9) 
where is the th triplet sample in the minibatch, is the minibatch size, , and denotes the norm.
IiC3 Contrastive Loss
The contrastive loss uses pairs of samples which can be anchor and neighbor or anchor and distant. If the samples are anchor and neighbor, they are pulled towards each other; otherwise, their distance is increased. In other words, the contrastive loss performs like the triplet loss but one by one rather than simultaneously. The desired embedding is obtained when the anchordistant distances get larger than the anchorneighbor distances by a margin of . This loss, to be minimized, is defined as [12]
(10) 
where is zero and one when the pair is anchorneighbor and anchordistant, respectively.
Iii The Proposed Loss Functions
Iiia Network Structure
Consider any arbitrary neural network as the backbone. This network can be either a multilayer perception or a convolutional network. Let
be the number of its output neurons, i.e., the dimensionality of its embedding space. We add a fully connected layer after the
neurons layer to a new embedding space (output layer) with neurons. Denote the weights of this layer by . We name the first dimensional embedding space as the latent space and the second dimensional embedding space as the feature space. Our proposed loss functions are networkagnostic as they can be used for any network structure and topology of the backbone. The overall network structure for the usage of the proposed loss functions is depicted in Fig. 2.Consider a triplet or a pair . We feed the triplet or pair to the network. We denote the latent embedding of data by and while the feature embedding of data is denoted by and
. The last layer of network is projecting the latent embedding to the feature space where the activation function of the last layer is linear because of unsupervised feature extraction. Hence, the last layer acts as a linear projection
.During the training, the latent space is adjusted to extract some features; however, the lastlayer projection finetunes the latent features in order to have better discriminative features. In Section IVD, we show the results of experiments to demonstrate this.
IiiB Fisher Discriminant Triplet Loss
As in neural networks the loss function is usually minimized, we minimize the negative of Fisher criterion where we use (5) as the criterion:
(11) 
This problem is illdefined because by increasing the total scatter of embedded data, the interclass scatter also gets larger and this objective function gets decreased. Therefore, the embedding space scales up and explodes gradually to increase the term . In order to control this issue, we penalize the total scatter of the embedded data, denoted by :
(12) 
where is the regularization parameter. The total scatter can be considered as the summation of the inter and intraclass scatters [26]:
(13) 
Hence, we have:
(14) 
where is because . It is recommended for and to be close to one and zero, respectively because the total scatter should be controlled not to explode. For example, a good value can be .
We want the interclass scatter term to get larger than the intraclass scatter term by a margin . Hence, the FDT loss, to be minimized, is defined as:
(15) 
where we defer the mathematical definition of intra and interclass scatter matrices in our loss functions to Section IIID.
IiiC Fisher Discriminant Contrastive Loss
Rather than the triplets of data, we can consider the pairs of samples. For this goal, we propose the FDC loss function defined as
(16) 
where the intra and interclass scatter matrices, which will be defined in Section IIID, consider the anchorneighbor and anchordistant pairs.
IiiD Intra and InterClass Scatters
IiiD1 Scatter Matrices in FDT
Let the output embedding of the backbone, i.e. the secondtolast layer of total structure, be denoted by . We call this embedding the latent embedding. Consider the latent embeddings of anchor, neighbor, and distant, denoted by , , and , respectively. If we have a minibatch of triplets, we can define and where is the th sample in the minibatch. The intra and interclass scatter matrices are, respectively, defined as
(17)  
(18) 
The ranks of the intra and interclass scatters are
. As the subspace of FDA can be interpreted as the eigenspace of
, the rank of the subspace would be because we usually have . In order to improve the rank of the embedding subspace, we slightly strengthen the main diagonal of the scatter matrices [22](19)  
(20) 
where are small positive numbers, e.g., . Hence, the embedding subspace becomes full rank with .
IiiD2 Scatter Matrices in FDC
As in the regular contrastive loss, we consider the pairs of anchorneighbor and anchordistant for the FDC loss. Let be zero and one when the pair is an anchorneighbor or anchordistant pair, respectively. The latent embedding of this pair is denoted by . The intra and interclass scatter matrices in the FDC loss are, respectively, defined as
(21)  
(22) 
where and .
Note that in both FDT and FDC loss functions, there exist the weight matrix and the intra and interclass scatter matrices. By backpropagation, both the last layer and the previous layers are trained because in loss affects the last layer, and the scatter matrices in loss impact all the layers.
Iv Experiments
Iva Datasets
For the experiments, we used three public datasets, i.e., MNIST and two challenging histopathology datasets. In the following, we introduce these datasets. The MNIST images have one channel, but the histopathology images exhibit color in three channels.
MNIST dataset – The MNIST dataset [20] includes 60,000 training images and 10,000 test images of size pixels. We created a dataset of 500 triplets from the training data to test the proposed loss functions for a small training sample size.
CRC dataset – The first histopathology dataset is the Colorectal Cancer (CRC) dataset [17]. It contains tissue patches from different tissue types of colorectal cancer tissue slides. The tissue types are background (empty), adipose tissue, mucosal glands, debris, immune cells (lymphoma), complex stroma, simple stroma, and tumor epithelium. Some sample patches of CRC tissue types can be seen in Fig. 4. We split data into train/test sets with 60%–40% portions. Using the training set, we extracted 22,528 triplets by considering the tissue types as the classes.
TCGA dataset – The second histopathology dataset is The Cancer Genome Atlas (TCGA) dataset [2]. TCGA Whole Slide Images (WSIs) come from 25 different organs for 32 different cancer subtypes. We use the three most common sites, which are prostate, gastrointestinal, and lung [2, 16]. These organs have a total of 9 cancer subtypes, i.e., Prostate adenocarcinoma (PRAD), Testicular germ cell tumors (TGCT), Oesophageal carcinoma (ESCA), Stomach adenocarcinoma (STAD), Colonic adenocarcinoma (COAD), Rectal adenocarcinoma (READ), Lung adenocarcinoma (LUAD), Lung squamous cell carcinoma (LUSC), and Mesothelioma (MESO). By sampling patches from slides, we extracted 22,528 triplets to test the proposed losses with a large triplet sample size. The anchor and neighbor patches were selected from one WSI, but we used four ways of extraction of the distant patch, i.e., from the same WSI but far from the anchor, from another WSI of the same cancer subtype as an anchor, from another cancer subtype but the same anatomic site as anchor, and from another anatomic site.
IvB Visual Comparison of Emebddings
In our experiments, we used ResNet18 [14] as the backbone in our Siamese network structure (see Fig. 2). In our experiments, we set , , , and . The learning rate was set to in all experiments.
Embedding of MNIST data – The embeddings of the train/test sets of the MNIST dataset in the feature spaces of different loss functions are illustrated in Fig. 3 where was used for FDT and FDC. We used the Uniform Manifold Approximation and Projection (UMAP) [21] for visualizing the 128dimensional embedded data. As can be seen, both embeddings of train and test data by the FDT loss are much more discriminating than the embedding of triplet loss. On the other hand, comparing the embeddings of contrastive and FDC losses shows that their performances are both good enough as the classes are well separated. Interestingly, the similar digits usually are embedded as close classes in the feature space, and this shows the meaningfulness of the trained subspace. For example, the digit pairs (3, 8), (1, 7), and (4, 9) with the second writing format of digit four can transition into each other by slight changes, and that is why they are embedded close together.
Embedding of histopathology data – For embedding of the histopathology data, we performed two different experiments. In the first experiment, we trained and tested the Siamese network using the CRC data. The second experiment was to train the Siamese network using TCGA data and test it using the CRC test set. The latter, which we denote by TCGACRC, is more difficult because it tests generalization of the feature space, which is trained by different data from the test data, although with a similar texture. Figure 4 shows the embeddings of the CRC test sets in the feature spaces trained by CRC and TCGA data. The embeddings by all losses, including FDT and FDC, are acceptable, noticing that the histopathology data are hard to discriminate even by a human (see the sample patches in Fig. 4). As expected, the empty and adipose data, which are similar, are embedded closely. Comparing the TCGACRC embeddings of contrastive and FDC losses shows FDC has discriminated classes slightly better. Overall, the good embedding in TCGACRC shows that the proposed losses can train a generalizing feature space, which is very important in histopathology analysis because of the lack of labeled data [15].
IvC Numerical Comparison of Embeddings
In addition to visualization, we can assess the embeddings numerically. For the evaluation of the embedded subspaces, we used the 1Nearest Neighbor (1NN) search because it is useful to evaluate the subspace by the closeness of the projected data samples. The accuracy rates of the 1NN search for the embedding test data by different loss functions are reported in Table I. We report the results for different values of in order to analyze the effect of this hyperparameter. As the results show, in most cases, the FDT and FDC losses have outperformed the triplet and contrastive losses, respectively. Moreover, we see that is often better performing. This can be because the large value of (e.g., ) imposes less penalty on the total scatter, which may cause the embedding space to expand gradually. The very small value of (e.g., ), on the other hand, puts too much emphasis on the total scatter where the classes do not tend to separate well enough, so they do not increase the total scatter.
MNIST  CRC  TCGACRC  

triplet  82.21%  95.75%  95.50% 
FDT ()  82.76%  96.45%  97.60% 
FDT ()  85.74%  96.05%  96.40% 
FDT ()  79.59%  95.35%  95.95% 
contrastive  89.99%  95.55%  96.55% 
FDC ()  78.47%  94.25%  96.55% 
FDC ()  89.00%  96.40%  98.10% 
FDC ()  87.71%  97.00%  97.05% 
IvD Comparison of the Latent and Feature Spaces
As explained in Section IIIA, the last layer behaves as a linear projection of the latent space onto the feature space. This projection finetunes the embeddings for better discrimination of classes. Figure 5 shows the latent embedding of the MNIST train set for both FDT and FDC loss functions. Comparing them to the feature embeddings of the MNIST train set in Fig. 3 shows that the feature embedding discriminates the classes much better than the latent embedding.
V Conclusions
In this paper, we proposed two novel loss functions for training Siamese networks. These losses were based on the theory of the FDA, which attempts to decrease the intraclass scatter but increase the interclass scatter of the projected data. The FDT and FDC losses make use of triplets and pairs of samples, respectively. By experimenting on MNIST and two histopathology datasets, we showed that the proposed losses mostly perform better than the wellknown triplet and contrastive loss functions for Siamese networks.
References
 [1] (2004) Convex optimization. Cambridge university press. Cited by: §IIB.
 [2] (2018) PanCancer insights from the cancer genome atlas: the pathologist’s perspective. The Journal of pathology 244 (5), pp. 512–524. Cited by: §IVA.
 [3] (2019) Deep least squares Fisher discriminant analysis. IEEE transactions on neural networks and learning systems. Cited by: §I.
 [4] (2017) Deep Fisher discriminant analysis. In International WorkConference on Artificial Neural Networks, pp. 501–512. Cited by: §I.
 [5] (1936) The use of multiple measurements in taxonomic problems. Annals of eugenics 7 (2), pp. 179–188. Cited by: §I, §IIA, §IIB.
 [6] (2009) The elements of statistical learning. Vol. 2, Springer series in statistics New York, NY, USA. Cited by: §I, §IIB.

[7]
(2013)
Introduction to statistical pattern recognition
. Elsevier. Cited by: §IIA.  [8] (2019) Eigenvalue and generalized eigenvalue problems: tutorial. arXiv preprint arXiv:1903.11240. Cited by: §IIA, §IIB.
 [9] (2019) Fisher and kernel Fisher discriminant analysis: tutorial. arXiv preprint arXiv:1906.09436. Cited by: §IIA, §IIB.
 [10] (2019) Roweis discriminant analysis: a generalized subspace learning method. arXiv preprint arXiv:1910.05437. Cited by: §I.
 [11] (2006) Metric learning by collapsing classes. In Advances in neural information processing systems, pp. 451–458. Cited by: §I.

[12]
(2006)
Dimensionality reduction by learning an invariant mapping.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, Vol. 2, pp. 1735–1742. Cited by: §I, §IIC3.  [13] (2018) A twofold siamese network for realtime object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4834–4843. Cited by: §I.
 [14] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §IVB.

[15]
(2017)
Analysis of histopathology images: from traditional machine learning to deep learning
. In Biomedical Texture Analysis, pp. 281–314. Cited by: §IVB. 
[16]
(2020)
Pancancer diagnostic consensus through searching archival histopathology images using artificial intelligence
. npj Digital Medicine 3 (1), pp. 1–15. Cited by: §IVA.  [17] (2016) Multiclass texture analysis in colorectal cancer histology. Scientific reports 6, pp. 27988. Cited by: §IVA.
 [18] (2015) Siamese neural networks for oneshot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: §I.
 [19] (2016) Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5385–5394. Cited by: §I.
 [20] MNIST handwritten digits dataset. Note: http://yann.lecun.com/exdb/mnist/Accessed: 2019 Cited by: §IVA.
 [21] (2018) UMAP: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: §IVB.
 [22] (1999) Fisher discriminant analysis with kernels. In Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468), pp. 41–48. Cited by: §IIID1.
 [23] (1998) The symmetric eigenvalue problem. Classics in Applied Mathematics 20. Cited by: §IIA, §IIB.

[24]
(2015)
Facenet: a unified embedding for face recognition and clustering
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §I, §IIC1, §IIC2.  [25] (2006) Analysis on Fisher discriminant criterion and linear separability of feature space. In 2006 International Conference on Computational Intelligence and Security, Vol. 2, pp. 1671–1676. Cited by: §IIA.
 [26] (2007) Least squares linear discriminant analysis. In Proceedings of the 24th international conference on machine learning, pp. 1087–1093. Cited by: §IIIB.
 [27] (2019) Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6210–6219. Cited by: §I.
Comments
There are no comments yet.