Variants of Siamese networks contain several, typically two  or three [25, 12], sub-networks sharing their weights. The Siamese topologies are robust networks for learning a discriminative embedding space, i.e., explicit metric space, between the classes of data . One of these variants is the triplet network in which anchor, positive and negative triplets are used for decreasing and increasing the distance of anchor-positive and anchor-negative pairs, respectively 
, resulting in increasing and decreasing the inter- and intra-class variances of data. Two popular forms of loss function for training triplets are triplet-loss  and the softmax form . Some examples for the latter are Neighborhood Component Analysis (NCA)  and proxy-NCA .
Apart from the loss functions, there is another degree of freedom, which is how the triplets are sampled. It is shown in that sampling of the triplets also matters in learning deep embeddings. Hence, proposing a decent sampling strategy has not less importance than a novel loss function. In other words, with triplet networks, drawing more informative and stable triplets from the pool of samples will lead to qualitatively more salient embeddings.
There are already some triplet mining strategies in the literature. Instead of using all the triplets in a mini-batch of data, i.e., Batch All (BA) , one can mine the triplets as in Batch Semi-Hard (BSH)  and Batch Hard (BH) . Some mining methods, such as Easy Positive (EP) , concentrate on the extreme distances of samples. However, some other triplet mining methods use the concept of sampling from the available triplets in a mini-batch of the data .
In this work, we aim to draw the positive and negative samples for every anchor instance in a dynamic manner. The main idea is to sample the positive and negative instances of triplets for every anchor in a mini-batch of data from some distributions rather than from the embedded data points themselves. This gives the triplet network more opportunity to explore the embedding space for increasing and decreasing the inter- and intra-class variances because the triplet information is not restricted to only the embedded data but is instead stochastic. That is while the related work on triplet sampling samples the triplets from the existing embedded data instances , it does not use the stochastic information of the embedding space. We assume a multivariate normal distribution for the embedded data instances of every class. These distributions are updated dynamically by receiving new streaming embedded data for the different classes. For this dynamic updating, we leverage the theory of Bayesian distribution updating [13, 4] and conjugate priors [23, 14]
. Sampling from dynamic distributions makes the task of sampling not only more robust to outliers but also more amenable to available data. The proposed approaches are called Bayesian Updating fortriplet-loss (BUT) and Bayesian Updating for NCA loss (BUNCA).
The rest of the paper is organized as follows: Section II introduces the necessary background on Bayesian updating and conjugate priors. The dynamic triplet sampling for training triplet networks is proposed in Section III. We report and discuss the experimental results in Section IV. Finally, Section V concludes the paper and highlights the possible future work.
Ii Background on Bayesian Updating
In this section, we describe the Bayesian updating and the conjugate priors. As well, we briefly review relevant distributions to lay the foundation for dynamic triplet sampling of our approach.
Ii-a Bayesian Updating
be two random variables whereis a parameter of the distribution of . According to Bayes’ rule, we have
which shows the relation of the posterior , likelihood , and prior . Given some data and the prior over the parameter of interest , we want to find the posterior using Eq. (1). This is the basic idea behind Bayesian updating in which the posterior over the parameter of interest is updated after receiving some new data, i.e., using the new data , we have .
Ii-B Conjugate Priors
If the posterior distribution and the prior distribution
are in the same probability distribution family, they are calledconjugate distributions and the prior is the conjugate prior for the likelihood .
Assume there already exist some data, denoted by , and some new data, , are received. The existing data has a distribution with some parameter(s) . The posterior of the parameter of interest, i.e., , can be updated using the new data. Hence, this can be used to update the parameter(s) of the distribution of using the newly received data .
Let the data
have a multivariate normal (or Gaussian) distribution, so its likelihood is. Assume both the mean and covariance of likelihood are considered as random variables, so includes mean and covariance. Using the new data , we want to update the parameters, mean and covariance, of the normal distribution. In this case, the likelihood has a multivariate normal distribution, and for updating the posterior, we should use the conjugate prior for the likelihood. The conjugate prior distribution for the multivariate normal distribution with both random mean and covariance is the normal-inverse-Wishart distribution 
. In our analysis, we also require the skewed generalized Student-distribution.
Ii-C Relevant Distributions
Multivariate Normal Distribution
: The Probability Density Function (PDF) of themultivariate normal distribution is defined as 
where is the dimensionality of data, denotes the determinant of matrix, and , , and
are the data, mean, and covariance of data, respectively. The mean and covariance of the normal distribution can be estimated by the sample mean and sample covariance matrix, respectively.
Wishart and Inverse Wishart Distributions: The PDF of the Wishart distribution is defined as 
where is the degrees of freedom (which should be ), is the scale matrix, denotes the trace of matrix, and is the multivariate gamma function :
Consider a variable with Wishart distribution, i.e., . Then, the variable has the inverse Wishart distribution whose PDF is defined as :
where is the scale matrix and we have 
. From the moments of the inverse Wishart distribution, the mean of a random variableis defined as follows :
Skewed Generalized Student- Distribution: The PDF of the Student- distribution is defined as 
where and are the mean and variance, respectively. The generalized Student- distribution can be -dimensional multivariate [24, Definition 2]:
where and are the mean and covariance, respectively. The mean of the skewed generalized Student- distribution is .
Normal-Inverse-Wishart Distribution: As was mentioned before, the prior distribution for the multivariate normal distribution with both mean and covariance as random variables is the inverse Wishart distribution. Recall that we have some existing data denoted by
. We show the set of existing data vectors bywhere is the sample size of the existing data. Assume that data have a multivariate normal distribution . Let and denote the sample mean of the existing and new data, respectively. Likewise, and are the sample covariance matrix over the existing and new data, respectively.
. The joint distribution of the mean and covariance is theNormal-Inverse-Wishart (NIW) distribution [4, 23]:
where and are the sample sizes of new data used for calculating the new mean and covariance matrix. In this work, we have .
Iii Dynamic Triplet Sampling for Training Triplet Networks
Iii-a Preliminaries and Notations
Consider a -dimensional training dataset where . The class labels of instances are . Suppose we have number of classes in the dataset. We use the mini-batch (of size
) stochastic gradient descent for training the network. Letdenote the training sample size per class in a mini-batch. We show the -th training instance of the -th class in a mini-batch by . Let denote the embedding of by the triplet network where the dimensionality of embedding space is .
The data for each class are accumulated by receiving new mini-batches of data. Let denote the sample size of accumulated data for the -th class so far. The sample size per -th class in a mini-batch is denoted by . In this work, we have and because we take the same sample size per class in the mini-batch. This is the sample size of new incoming data per class in every mini-batch. The accumulated data for the -th class so far are denoted by . Also, and are the mean and covariance of the distribution of the -th class, respectively.
Iii-B Sampling Algorithm
We assume a multivariate normal distribution for the embedded data of every class. This assumption makes sense according to the central limit theorem and the fact that the normal distribution is the most common continuous distribution. In the first batch, where there is not already any embedding of training data, we use Maximum Likelihood Estimation (MLE) to estimate the distribution parameters. The mean and covariance of the embedded data of every class are estimated by the sample mean and covariance matrix, respectively.
In later batches after the first batch, we do have some existing data per class, denoted by . According to Bayesian updating, the mean and covariance of distribution of every class are updated by Eqs. (14) and (15), respectively. We update the mean and covariance matrix of the distribution of every class by the expectation of Eqs. (14) and (15) which are the generalized Student- and the inverse Wishart distributions, respectively. According to the expectations of these two distributions which were introduced in Section II, the updates of mean and covariance of the -th class can be given as
where, in Eq. (13), we use and calculate , , , and by sample mean and sample covariance matrix using the new batch of data. Note that for
which is in very first mini-batches of first epoch, we update the covariance matrix by MLE.
The proposed dynamic triplet sampling is summarized in Algorithm LABEL:algorithm_dynamic_sampling. The mean and covariance of every class are estimated by MLE at the initial batch. In the following batches, Bayesian updating is exploited for updating the mean and covariance of classes. After the means and covariances are updated, we sample the triplets. For every instance of a batch, considered as an “anchor”, a negative instance is sampled from each different class resulting in negatives per anchor. Accordingly, positive instances are also sampled from the same class of anchor. Overall, triplets are sampled in every mini-batch while the distributions of classes are being updated dynamically.
Iii-C Optimization of the Loss Functions
In a mini-batch, let the anchor, positive, and negative instances be indexed by , , , respectively. Using sampled triplets, the triplet-loss function can be employed to train the triplet network :
where denotes the standard Hinge loss and is a small margin (e.g., ). When dynamic triplet sampling is used with the triplet loss, we call this Bayesian Updating for triplet-loss (BUT).
As was mentioned before, the triplet-loss should increase and decrease the inter- and intra-class variances to have a discriminating embedding space for classes of data. This intuition can also be implemented in a softmax form  which is referred to as NCA . We can use this form to train the network:
We name using dynamic triplet sampling with the NCA loss function Bayesian Updating for NCA loss (BUNCA).
We used two different datasets in our experiments. The first dataset is the MNIST digits data  with 60,000 training instances and 10,000 test instances of size pixels. The second dataset we used is the large colorectal cancer (CRC) histopathology dataset [16, 17] with 100,000 stain-normalized image patches of size pixels. The large CRC dataset includes nine classes of tissues, namely adipose, background, debris, lymphocytes, mucus, smooth muscle, normal colon mucosa (normal), cancer-associated stroma, and colorectal adenocarcinoma epithelium (tumor). Note that literature has shown the effectiveness of triplet variants networks for histopathology data, both with triplet-loss  and with NCA loss ; this shows the importance of validating our approaches on this domain.
Iv-B Experimental Setup
For the MNIST dataset, we split the training data into and portions for training and validation sets. The test set with 10,000 images was used for the test. The CRC data were split into training, validation, and test sets with , , and portions, respectively. We used ResNet-18 network  as the backbone of triplet network. Using the validation set, early stopping  was employed, and the maximum number of epochs was set to . The batch size was and for the MNIST and CRC data, respectively, where every batch contains five instances per class (i.e., ). The learning rate was set to , and the dimensionality of the embedding space was .
Iv-C Visualization of Embedding Spaces
The 2D visualization of spaces was performed using the Uniform Manifold Approximation and Projection (UMAP)  applied to the embedded data. Figure 1 illustrates the embedding of test sets of the MNIST and CRC data using the BUT and BUNCA sampling methods. As apparent in this figure, the learned embedding spaces are interpretable. In embeddings of MNIST data, the similar digits, in the style of writing, fall close to one another. Closely embedded digits by BUT (see Fig. 1-a) are the digits 1 and 7, 7 and 9, 3 and 8, and 4 (second style of writing) and 9. Likewise, closely embedded digits by BUNCA (see Fig. 1-b) are the digits 0 and 6, 1 and 7, 7 and 9, 3 and 8, and 2 and 3 (because continuing the underneath curve of 2 results in 3).
The embedding spaces for the histopathology data are also meaningful. The histopathology patches with similar patterns have been embedded close to each other as expected. In embedding using the BUT approach (see Fig. 1-c), the patches are embedded from smoothest to roughest patterns in a circular manner. These patches, with smoothest to roughest  patterns, are adipose (with thin stripes of fat), mucus, smooth muscle, debris, stroma, tumor, normal, and lymphocyte (with a rough pattern). Moreover, the background patch with no pattern is separated from the tissues, as expected. In embedding using the BUNCA approach (see Fig. 1-d), the patches with a considerable amount of roughness are embedded closely. For example, adipose, mucus, stroma, and smooth muscle, which are smoother, fall close to each other while tumor, normal, lymphocyte, and debris, with diverse patterns, are embedded close to each other. Again, the background patches are embedded far from the tissue types. The meaningfulness of the learned embedded spaces shows the effectiveness of the proposed BUT and BUNCA approaches.
Iv-D Query Retrieval
For the evaluation of the embedding space, one can see the embedded instances as a database where nearby cases can be retrieved as matched cases for a query instance. The retrievals are extracted using the nearest neighbors in the embedding space. Because of representation learning, the retrievals are expected to be similar to the query in terms of pattern. In Fig. 2, we illustrate the top ten retrievals for query examples for both MNIST and histopathology data. The retrievals in the embedding spaces using both BUT and BUNCA approaches are shown to visually verify the similarity matching.
Iv-D1 Retrieval of Digit Images
In Fig. 2, the retrievals for a digit 4 with the second style of writing are depicted. As expected, the retrievals are very similar to the pattern of the query image. Compared to the last retrievals, the first retrievals are more similar to the query as expected. For this query example in the BUNCA approach, one of the retrievals is wrong, but it is interpretable. The second writing style of digit ”4” is very similar to digit ”9” and can be morphed into it by a slight change.
Iv-D2 Retrieval of Histopathology Patches
Query retrieval can be very useful for histopathology data in hospitals where similar patches are extracted from the database to rely on already diagnosed cases. The type of disease or tissue can be found out by a majority vote amongst the retrievals . Fig. 2 shows retrievals for two different tissue types, which are tumor and mucus. The former has more complex patterns, in contrast to the latter one. As the figure shows, the retrievals are very similar to the pattern of query patch.
Iv-E Comparison with Baseline Methods
In Tables I and II, we compare the proposed BUT and BUNCA approaches with the existing triplet mining methods in the literature. These tables report the Recall@ (R@) metric on the embedded test data, for different values of . The baseline approaches, which we compare with, are BA , BSH , BH , EP , DWS , NCA , and proxy-NCA ; these methods were briefly introduced in Section I. Among these methods, DWS is a sampling method that samples from the existing instances in the mini-batch in contrast to our proposed approach, which samples from the distribution of data.
Table I reports the results for the MNIST dataset. The proposed BUT approach outperforms all other methods. Moreover, BUNCA performs better than EP and DWS, where DWS is also a sampling approach for triplet mining. The results for the CRC histopathology data are reported in Table II. On this data, the performance of BUNCA is closer to BUT. In most cases, BUT has the best performance against all the baseline approaches. On this dataset, BUNCA performs better than BA, BSH, EP, DWS, NCA, and is comparable with proxy-NCA. Overall, these two tables demonstrate the effectiveness of the proposed mining approaches for triplet training.
V Conclusions and Future Direction
Different triplet mining approaches have been proposed since the introduction of triplet networks. In this paper, we proposed a triplet mining method which considers a multivariate normal distribution for the embedding of every class through sampling the triplets from these distributions rather than from the existing instances in the mini-batch. By Bayesian updating, the distributions are dynamically updated using the received stream of mini-batches. This approach makes use of the stochastic information of the embedding space, rather than being restricted to the existing instances, for better discrimination of classes. The proposed BUT and BUNCA approaches of the dynamic triplet sampling were validated by experiments on two public datasets and compared against baseline methods from literature. As a possible future work, one can explore a mixture of Gaussian distributions for every class of data using expectation maximization.
Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. In Advances in neural information processing systems, pp. 402–408. Cited by: §IV-B.
Image set classification via template triplets and context-aware similarity embedding.
Asian Conference on Computer Vision, pp. 231–247. Cited by: §I.
-  (2015) Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition 48 (10), pp. 2993–3003. Cited by: §I, §IV-E, TABLE I, TABLE II.
-  (2013) Bayesian data analysis. CRC press. Cited by: §I, §II-B, §II-C, §II-C, §II-C, §II-C, §II-C.
Fisher discriminant triplet and contrastive losses for training siamese networks.
2020 international joint conference on neural networks (IJCNN), Cited by: §I.
-  (2005) Neighbourhood components analysis. In Advances in neural information processing systems, pp. 513–520. Cited by: §I, §III-C, §IV-E, TABLE I, TABLE II.
-  (2018) Matrix variate distributions. Vol. 104, CRC Press. Cited by: §II-C.
-  (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 1735–1742. Cited by: §I.
-  (2001) Central limit theorem. Encyclopedia of Mathematics, Springer. Cited by: §III-B.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §IV-B.
-  (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §I, §IV-E, TABLE I, TABLE II.
-  (2015) Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Cited by: §I.
-  (1992) Bayesian updating and belief functions. IEEE transactions on systems, man, and cybernetics 22 (5), pp. 1144–1152. Cited by: §I, §II-A.
-  (2010) The conjugate prior for the normal distribution. Technical report University of California, Berkeley. Cited by: §I, §II-B.
Pan-cancer diagnostic consensus through searching archival histopathology images using artificial intelligence. NPJ digital medicine 3 (1), pp. 1–15. Cited by: §IV-D2.
-  Cited by: §IV-A.
Predicting survival from colorectal cancer histology slides using deep learning: a retrospective multicenter study. PLoS medicine 16 (1). Cited by: §IV-A.
-  (2016) Multi-class texture analysis in colorectal cancer histology. Scientific reports 6, pp. 27988. Cited by: §IV-C.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §IV-A.
-  (1979) Multivariate analysis. AcadeInic Press, Londres. Cited by: §II-C.
-  (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: §IV-C.
-  (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: §I, §IV-E, TABLE I, TABLE II.
-  (2007) Conjugate Bayesian analysis of the Gaussian distribution. Technical report University of British Colombia. Cited by: §I, §II-B, §II-C, §II-C, §II-C.
-  (2013) A generalised Student’s t-distribution. Statistics & Probability Letters 83 (1), pp. 70–77. Cited by: §II-C.
Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §I, §I, §III-C, §IV-E, TABLE I, TABLE II.
-  (2020) Supervision and source domain impact on representation learning: a histopathology case study. In 2020 International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Cited by: §IV-A.
-  (2020) Learning with less data via weakly labeled patch classification in digital pathology. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 471–475. Cited by: §IV-A.
-  (1998) Financial data and the skewed generalized t distribution. Management Science 44 (12-part-1), pp. 1650–1661. Cited by: §II-C.
-  (1988) Moments for the inverted Wishart distribution. Scandinavian Journal of Statistics, pp. 97–109. Cited by: §II-C.
-  (2017) Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848. Cited by: §I, §I, §I, §IV-E, TABLE I, TABLE II.
-  (2020) Improved embeddings with easy positive triplet mining. In The IEEE Winter Conference on Applications of Computer Vision, pp. 2474–2482. Cited by: §I, §IV-E, TABLE I, TABLE II.
-  (2019) Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6210–6219. Cited by: §I, §III-C.