Batch-Incremental Triplet Sampling for Training Triplet Networks Using Bayesian Updating Theorem

07/10/2020 ∙ by Milad Sikaroudi, et al. ∙ University of Waterloo 0

Variants of Triplet networks are robust entities for learning a discriminative embedding subspace. There exist different triplet mining approaches for selecting the most suitable training triplets. Some of these mining methods rely on the extreme distances between instances, and some others make use of sampling. However, sampling from stochastic distributions of data rather than sampling merely from the existing embedding instances can provide more discriminative information. In this work, we sample triplets from distributions of data rather than from existing instances. We consider a multivariate normal distribution for the embedding of each class. Using Bayesian updating and conjugate priors, we update the distributions of classes dynamically by receiving the new mini-batches of training data. The proposed triplet mining with Bayesian updating can be used with any triplet-based loss function, e.g., triplet-loss or Neighborhood Component Analysis (NCA) loss. Accordingly, Our triplet mining approaches are called Bayesian Updating Triplet (BUT) and Bayesian Updating NCA (BUNCA), depending on which loss function is being used. Experimental results on two public datasets, namely MNIST and histopathology colorectal cancer (CRC), substantiate the effectiveness of the proposed triplet mining method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Variants of Siamese networks contain several, typically two [8] or three [25, 12], sub-networks sharing their weights. The Siamese topologies are robust networks for learning a discriminative embedding space, i.e., explicit metric space, between the classes of data [2]. One of these variants is the triplet network in which anchor, positive and negative triplets are used for decreasing and increasing the distance of anchor-positive and anchor-negative pairs, respectively [25]

, resulting in increasing and decreasing the inter- and intra-class variances of data

[5]. Two popular forms of loss function for training triplets are triplet-loss [25] and the softmax form [32]. Some examples for the latter are Neighborhood Component Analysis (NCA) [6] and proxy-NCA [22].

Apart from the loss functions, there is another degree of freedom, which is how the triplets are sampled. It is shown in

[30] that sampling of the triplets also matters in learning deep embeddings. Hence, proposing a decent sampling strategy has not less importance than a novel loss function. In other words, with triplet networks, drawing more informative and stable triplets from the pool of samples will lead to qualitatively more salient embeddings.

There are already some triplet mining strategies in the literature. Instead of using all the triplets in a mini-batch of data, i.e., Batch All (BA) [3], one can mine the triplets as in Batch Semi-Hard (BSH) [25] and Batch Hard (BH) [11]. Some mining methods, such as Easy Positive (EP) [31], concentrate on the extreme distances of samples. However, some other triplet mining methods use the concept of sampling from the available triplets in a mini-batch of the data [30].

In this work, we aim to draw the positive and negative samples for every anchor instance in a dynamic manner. The main idea is to sample the positive and negative instances of triplets for every anchor in a mini-batch of data from some distributions rather than from the embedded data points themselves. This gives the triplet network more opportunity to explore the embedding space for increasing and decreasing the inter- and intra-class variances because the triplet information is not restricted to only the embedded data but is instead stochastic. That is while the related work on triplet sampling samples the triplets from the existing embedded data instances [30], it does not use the stochastic information of the embedding space. We assume a multivariate normal distribution for the embedded data instances of every class. These distributions are updated dynamically by receiving new streaming embedded data for the different classes. For this dynamic updating, we leverage the theory of Bayesian distribution updating [13, 4] and conjugate priors [23, 14]

. Sampling from dynamic distributions makes the task of sampling not only more robust to outliers but also more amenable to available data. The proposed approaches are called Bayesian Updating for

triplet-loss (BUT) and Bayesian Updating for NCA loss (BUNCA).

The rest of the paper is organized as follows: Section II introduces the necessary background on Bayesian updating and conjugate priors. The dynamic triplet sampling for training triplet networks is proposed in Section III. We report and discuss the experimental results in Section IV. Finally, Section V concludes the paper and highlights the possible future work.

Ii Background on Bayesian Updating

In this section, we describe the Bayesian updating and the conjugate priors. As well, we briefly review relevant distributions to lay the foundation for dynamic triplet sampling of our approach.

Ii-a Bayesian Updating

Let and

be two random variables where

is a parameter of the distribution of . According to Bayes’ rule, we have

(1)

which shows the relation of the posterior , likelihood , and prior . Given some data and the prior over the parameter of interest , we want to find the posterior using Eq. (1). This is the basic idea behind Bayesian updating in which the posterior over the parameter of interest is updated after receiving some new data, i.e., using the new data , we have [13].

Ii-B Conjugate Priors

If the posterior distribution and the prior distribution

are in the same probability distribution family, they are called

conjugate distributions and the prior is the conjugate prior for the likelihood [4].

Assume there already exist some data, denoted by , and some new data, , are received. The existing data has a distribution with some parameter(s) . The posterior of the parameter of interest, i.e., , can be updated using the new data. Hence, this can be used to update the parameter(s) of the distribution of using the newly received data [14].

Let the data

have a multivariate normal (or Gaussian) distribution, so its likelihood is

. Assume both the mean and covariance of likelihood are considered as random variables, so includes mean and covariance. Using the new data , we want to update the parameters, mean and covariance, of the normal distribution. In this case, the likelihood has a multivariate normal distribution, and for updating the posterior, we should use the conjugate prior for the likelihood. The conjugate prior distribution for the multivariate normal distribution with both random mean and covariance is the normal-inverse-Wishart distribution [23]

. In our analysis, we also require the skewed generalized Student-

distribution.

Ii-C Relevant Distributions

Multivariate Normal Distribution

: The Probability Density Function (PDF) of the

multivariate normal distribution is defined as [4]

(2)

where is the dimensionality of data, denotes the determinant of matrix, and , , and

are the data, mean, and covariance of data, respectively. The mean and covariance of the normal distribution can be estimated by the sample mean and sample covariance matrix, respectively.

Wishart and Inverse Wishart Distributions: The PDF of the Wishart distribution is defined as [4]

(3)

where is the degrees of freedom (which should be ), is the scale matrix, denotes the trace of matrix, and is the multivariate gamma function [7]:

(4)

Consider a variable with Wishart distribution, i.e., . Then, the variable has the inverse Wishart distribution whose PDF is defined as [4]:

(5)

where is the scale matrix and we have [20]

. From the moments of the inverse Wishart distribution, the mean of a random variable

is defined as follows [29]:

(6)

Skewed Generalized Student- Distribution: The PDF of the Student- distribution is defined as [4]

(7)

where is a degree of freedom and is the Gamma function. The Student- distribution can be generalized which is called the skewed generalized Student- distribution whose PDF is defined as [23, 28]

(8)

where and are the mean and variance, respectively. The generalized Student- distribution can be -dimensional multivariate [24, Definition 2]:

(9)

where and are the mean and covariance, respectively. The mean of the skewed generalized Student- distribution is [23].

Normal-Inverse-Wishart Distribution: As was mentioned before, the prior distribution for the multivariate normal distribution with both mean and covariance as random variables is the inverse Wishart distribution. Recall that we have some existing data denoted by

. We show the set of existing data vectors by

where is the sample size of the existing data. Assume that data have a multivariate normal distribution . Let and denote the sample mean of the existing and new data, respectively. Likewise, and are the sample covariance matrix over the existing and new data, respectively.

The prior of covariance is and the distribution of mean given covariance is [4, 23]

. The joint distribution of the mean and covariance is the

Normal-Inverse-Wishart (NIW) distribution [4, 23]:

(10)

where and are the sample sizes of new data used for calculating the new mean and covariance matrix. In this work, we have .

The posterior of mean and covariance of data is again a NIW distribution [4, 23]:

(11)
(12)
(13)

The marginal distributions of mean and covariance of data are [4, 23]:

(14)
(15)

respectively. The Eqs. (14) and (15) can be used to update the parameters of a multivariate normal distribution upon receiving the new data.

Iii Dynamic Triplet Sampling for Training Triplet Networks

Iii-a Preliminaries and Notations

Consider a -dimensional training dataset where . The class labels of instances are . Suppose we have number of classes in the dataset. We use the mini-batch (of size

) stochastic gradient descent for training the network. Let

denote the training sample size per class in a mini-batch. We show the -th training instance of the -th class in a mini-batch by . Let denote the embedding of by the triplet network where the dimensionality of embedding space is .

The data for each class are accumulated by receiving new mini-batches of data. Let denote the sample size of accumulated data for the -th class so far. The sample size per -th class in a mini-batch is denoted by . In this work, we have and because we take the same sample size per class in the mini-batch. This is the sample size of new incoming data per class in every mini-batch. The accumulated data for the -th class so far are denoted by . Also, and are the mean and covariance of the distribution of the -th class, respectively.

Iii-B Sampling Algorithm

We assume a multivariate normal distribution for the embedded data of every class. This assumption makes sense according to the central limit theorem

[9] and the fact that the normal distribution is the most common continuous distribution. In the first batch, where there is not already any embedding of training data, we use Maximum Likelihood Estimation (MLE) to estimate the distribution parameters. The mean and covariance of the embedded data of every class are estimated by the sample mean and covariance matrix, respectively.

In later batches after the first batch, we do have some existing data per class, denoted by . According to Bayesian updating, the mean and covariance of distribution of every class are updated by Eqs. (14) and (15), respectively. We update the mean and covariance matrix of the distribution of every class by the expectation of Eqs. (14) and (15) which are the generalized Student- and the inverse Wishart distributions, respectively. According to the expectations of these two distributions which were introduced in Section II, the updates of mean and covariance of the -th class can be given as

(16)
(17)

where, in Eq. (13), we use and calculate , , , and by sample mean and sample covariance matrix using the new batch of data. Note that for

which is in very first mini-batches of first epoch, we update the covariance matrix by MLE.

The proposed dynamic triplet sampling is summarized in Algorithm LABEL:algorithm_dynamic_sampling. The mean and covariance of every class are estimated by MLE at the initial batch. In the following batches, Bayesian updating is exploited for updating the mean and covariance of classes. After the means and covariances are updated, we sample the triplets. For every instance of a batch, considered as an “anchor”, a negative instance is sampled from each different class resulting in negatives per anchor. Accordingly, positive instances are also sampled from the same class of anchor. Overall, triplets are sampled in every mini-batch while the distributions of classes are being updated dynamically.

algocf[!t]

Iii-C Optimization of the Loss Functions

In a mini-batch, let the anchor, positive, and negative instances be indexed by , , , respectively. Using sampled triplets, the triplet-loss function can be employed to train the triplet network [25]:

(18)

where denotes the standard Hinge loss and is a small margin (e.g., ). When dynamic triplet sampling is used with the triplet loss, we call this Bayesian Updating for triplet-loss (BUT).

As was mentioned before, the triplet-loss should increase and decrease the inter- and intra-class variances to have a discriminating embedding space for classes of data. This intuition can also be implemented in a softmax form [32] which is referred to as NCA [6]. We can use this form to train the network:

(19)

We name using dynamic triplet sampling with the NCA loss function Bayesian Updating for NCA loss (BUNCA).

Iv Experiments

Fig. 1: 2D visualization of test embeddings: (a) MNIST using BUT, (b) MNIST using BUNCA, (c) CRC using BUT, and (d) CRC using BUNCA.
Fig. 2: Image retrieval in the embedded spaces learned using the BUT and BUNCA approaches. The retrievals are sorted from left to right.

Iv-a Datasets

We used two different datasets in our experiments. The first dataset is the MNIST digits data [19] with 60,000 training instances and 10,000 test instances of size pixels. The second dataset we used is the large colorectal cancer (CRC) histopathology dataset [16, 17] with 100,000 stain-normalized image patches of size pixels. The large CRC dataset includes nine classes of tissues, namely adipose, background, debris, lymphocytes, mucus, smooth muscle, normal colon mucosa (normal), cancer-associated stroma, and colorectal adenocarcinoma epithelium (tumor). Note that literature has shown the effectiveness of triplet variants networks for histopathology data, both with triplet-loss [26] and with NCA loss [27]; this shows the importance of validating our approaches on this domain.

Iv-B Experimental Setup

For the MNIST dataset, we split the training data into and portions for training and validation sets. The test set with 10,000 images was used for the test. The CRC data were split into training, validation, and test sets with , , and portions, respectively. We used ResNet-18 network [10] as the backbone of triplet network. Using the validation set, early stopping [1] was employed, and the maximum number of epochs was set to . The batch size was and for the MNIST and CRC data, respectively, where every batch contains five instances per class (i.e., ). The learning rate was set to , and the dimensionality of the embedding space was .

Iv-C Visualization of Embedding Spaces

The 2D visualization of spaces was performed using the Uniform Manifold Approximation and Projection (UMAP) [21] applied to the embedded data. Figure 1 illustrates the embedding of test sets of the MNIST and CRC data using the BUT and BUNCA sampling methods. As apparent in this figure, the learned embedding spaces are interpretable. In embeddings of MNIST data, the similar digits, in the style of writing, fall close to one another. Closely embedded digits by BUT (see Fig. 1-a) are the digits 1 and 7, 7 and 9, 3 and 8, and 4 (second style of writing) and 9. Likewise, closely embedded digits by BUNCA (see Fig. 1-b) are the digits 0 and 6, 1 and 7, 7 and 9, 3 and 8, and 2 and 3 (because continuing the underneath curve of 2 results in 3).

The embedding spaces for the histopathology data are also meaningful. The histopathology patches with similar patterns have been embedded close to each other as expected. In embedding using the BUT approach (see Fig. 1-c), the patches are embedded from smoothest to roughest patterns in a circular manner. These patches, with smoothest to roughest [18] patterns, are adipose (with thin stripes of fat), mucus, smooth muscle, debris, stroma, tumor, normal, and lymphocyte (with a rough pattern). Moreover, the background patch with no pattern is separated from the tissues, as expected. In embedding using the BUNCA approach (see Fig. 1-d), the patches with a considerable amount of roughness are embedded closely. For example, adipose, mucus, stroma, and smooth muscle, which are smoother, fall close to each other while tumor, normal, lymphocyte, and debris, with diverse patterns, are embedded close to each other. Again, the background patches are embedded far from the tissue types. The meaningfulness of the learned embedded spaces shows the effectiveness of the proposed BUT and BUNCA approaches.

Iv-D Query Retrieval

For the evaluation of the embedding space, one can see the embedded instances as a database where nearby cases can be retrieved as matched cases for a query instance. The retrievals are extracted using the nearest neighbors in the embedding space. Because of representation learning, the retrievals are expected to be similar to the query in terms of pattern. In Fig. 2, we illustrate the top ten retrievals for query examples for both MNIST and histopathology data. The retrievals in the embedding spaces using both BUT and BUNCA approaches are shown to visually verify the similarity matching.

Iv-D1 Retrieval of Digit Images

In Fig. 2, the retrievals for a digit 4 with the second style of writing are depicted. As expected, the retrievals are very similar to the pattern of the query image. Compared to the last retrievals, the first retrievals are more similar to the query as expected. For this query example in the BUNCA approach, one of the retrievals is wrong, but it is interpretable. The second writing style of digit ”4” is very similar to digit ”9” and can be morphed into it by a slight change.

Iv-D2 Retrieval of Histopathology Patches

Query retrieval can be very useful for histopathology data in hospitals where similar patches are extracted from the database to rely on already diagnosed cases. The type of disease or tissue can be found out by a majority vote amongst the retrievals [15]. Fig. 2 shows retrievals for two different tissue types, which are tumor and mucus. The former has more complex patterns, in contrast to the latter one. As the figure shows, the retrievals are very similar to the pattern of query patch.

R@1 R@4 R@8 R@16
BA [3] 79.31 93.53 96.55 98.21
BSH [25] 78.95 92.61 96.09 98.17
BH [11] 85.75 95.31 97.43 98.63
EP [31] 73.34 90.09 95.08 97.68
DWS [30] 76.44 91.35 95.72 97.68
NCA [6] 85.40 95.48 97.46 98.76
proxy-NCA [22] 83.71 94.69 97.31 98.55
BUT 88.03 96.25 98.15 99.09
BUNCA 78.67 92.44 95.77 98.02
TABLE I: Comparison of the proposed triplet mining approaches with the baselines on the MNIST dataset.

Iv-E Comparison with Baseline Methods

In Tables I and II, we compare the proposed BUT and BUNCA approaches with the existing triplet mining methods in the literature. These tables report the Recall@ (R@) metric on the embedded test data, for different values of . The baseline approaches, which we compare with, are BA [3], BSH [25], BH [11], EP [31], DWS [30], NCA [6], and proxy-NCA [22]; these methods were briefly introduced in Section I. Among these methods, DWS is a sampling method that samples from the existing instances in the mini-batch in contrast to our proposed approach, which samples from the distribution of data.

Table I reports the results for the MNIST dataset. The proposed BUT approach outperforms all other methods. Moreover, BUNCA performs better than EP and DWS, where DWS is also a sampling approach for triplet mining. The results for the CRC histopathology data are reported in Table II. On this data, the performance of BUNCA is closer to BUT. In most cases, BUT has the best performance against all the baseline approaches. On this dataset, BUNCA performs better than BA, BSH, EP, DWS, NCA, and is comparable with proxy-NCA. Overall, these two tables demonstrate the effectiveness of the proposed mining approaches for triplet training.

R@1 R@4 R@8 R@16
BA [3] 38.54 66.76 80.64 89.97
BSH [25] 30.85 60.39 77.73 90.33
BH [11] 79.09 92.60 96.00 97.95
EP [31] 69.94 87.88 93.20 96.38
DWS [30] 76.06 91.31 95.34 97.58
NCA [6] 77.87 92.25 95.92 98.01
proxy-NCA [22] 78.85 92.24 95.80 97.78
BUT 79.14 92.32 95.60 97.65
BUNCA 78.67 92.28 95.64 97.71
TABLE II: Comparison of the proposed triplet mining approaches with the baselines on the CRC dataset.

V Conclusions and Future Direction

Different triplet mining approaches have been proposed since the introduction of triplet networks. In this paper, we proposed a triplet mining method which considers a multivariate normal distribution for the embedding of every class through sampling the triplets from these distributions rather than from the existing instances in the mini-batch. By Bayesian updating, the distributions are dynamically updated using the received stream of mini-batches. This approach makes use of the stochastic information of the embedding space, rather than being restricted to the existing instances, for better discrimination of classes. The proposed BUT and BUNCA approaches of the dynamic triplet sampling were validated by experiments on two public datasets and compared against baseline methods from literature. As a possible future work, one can explore a mixture of Gaussian distributions for every class of data using expectation maximization.

References

  • [1] R. Caruana, S. Lawrence, and C. L. Giles (2001)

    Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping

    .
    In Advances in neural information processing systems, pp. 402–408. Cited by: §IV-B.
  • [2] F. Chang and R. Nevatia (2016) Image set classification via template triplets and context-aware similarity embedding. In

    Asian Conference on Computer Vision

    ,
    pp. 231–247. Cited by: §I.
  • [3] S. Ding, L. Lin, G. Wang, and H. Chao (2015) Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition 48 (10), pp. 2993–3003. Cited by: §I, §IV-E, TABLE I, TABLE II.
  • [4] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin (2013) Bayesian data analysis. CRC press. Cited by: §I, §II-B, §II-C, §II-C, §II-C, §II-C, §II-C.
  • [5] B. Ghojogh, M. Sikaroudi, S. Shafiei, H. Tizhoosh, F. Karray, and M. Crowley (2020) Fisher discriminant triplet and contrastive losses for training siamese networks. In

    2020 international joint conference on neural networks (IJCNN)

    ,
    Cited by: §I.
  • [6] J. Goldberger, G. E. Hinton, S. T. Roweis, and R. R. Salakhutdinov (2005) Neighbourhood components analysis. In Advances in neural information processing systems, pp. 513–520. Cited by: §I, §III-C, §IV-E, TABLE I, TABLE II.
  • [7] A. K. Gupta and D. K. Nagar (2018) Matrix variate distributions. Vol. 104, CRC Press. Cited by: §II-C.
  • [8] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 1735–1742. Cited by: §I.
  • [9] M. Hazewinkel (2001) Central limit theorem. Encyclopedia of Mathematics, Springer. Cited by: §III-B.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §IV-B.
  • [11] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §I, §IV-E, TABLE I, TABLE II.
  • [12] E. Hoffer and N. Ailon (2015) Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Cited by: §I.
  • [13] J. Jaffray (1992) Bayesian updating and belief functions. IEEE transactions on systems, man, and cybernetics 22 (5), pp. 1144–1152. Cited by: §I, §II-A.
  • [14] M. I. Jordan (2010) The conjugate prior for the normal distribution. Technical report University of California, Berkeley. Cited by: §I, §II-B.
  • [15] S. Kalra, H. Tizhoosh, S. Shah, C. Choi, S. Damaskinos, A. Safarpoor, S. Shafiei, M. Babaie, P. Diamandis, C. J. Campbell, and L. Pantanowitz (2020)

    Pan-cancer diagnostic consensus through searching archival histopathology images using artificial intelligence

    .
    NPJ digital medicine 3 (1), pp. 1–15. Cited by: §IV-D2.
  • [16] Cited by: §IV-A.
  • [17] J. N. Kather, J. Krisam, P. Charoentong, T. Luedde, E. Herpel, C. Weis, T. Gaiser, A. Marx, N. A. Valous, D. Ferber, et al. (2019)

    Predicting survival from colorectal cancer histology slides using deep learning: a retrospective multicenter study

    .
    PLoS medicine 16 (1). Cited by: §IV-A.
  • [18] J. N. Kather, C. Weis, F. Bianconi, S. M. Melchers, L. R. Schad, T. Gaiser, A. Marx, and F. G. Zöllner (2016) Multi-class texture analysis in colorectal cancer histology. Scientific reports 6, pp. 27988. Cited by: §IV-C.
  • [19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §IV-A.
  • [20] K. Mardia, J. Kent, and J. Bibby (1979) Multivariate analysis. AcadeInic Press, Londres. Cited by: §II-C.
  • [21] L. McInnes, J. Healy, and J. Melville (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: §IV-C.
  • [22] Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: §I, §IV-E, TABLE I, TABLE II.
  • [23] K. P. Murphy (2007) Conjugate Bayesian analysis of the Gaussian distribution. Technical report University of British Colombia. Cited by: §I, §II-B, §II-C, §II-C, §II-C.
  • [24] I. Papastathopoulos and J. A. Tawn (2013) A generalised Student’s t-distribution. Statistics & Probability Letters 83 (1), pp. 70–77. Cited by: §II-C.
  • [25] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §I, §I, §III-C, §IV-E, TABLE I, TABLE II.
  • [26] M. Sikaroudi, A. Safarpoor, B. Ghojogh, S. Shafiei, M. Crowley, and H. Tizhoosh (2020) Supervision and source domain impact on representation learning: a histopathology case study. In 2020 International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Cited by: §IV-A.
  • [27] E. W. Teh and G. W. Taylor (2020) Learning with less data via weakly labeled patch classification in digital pathology. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 471–475. Cited by: §IV-A.
  • [28] P. Theodossiou (1998) Financial data and the skewed generalized t distribution. Management Science 44 (12-part-1), pp. 1650–1661. Cited by: §II-C.
  • [29] D. von Rosen (1988) Moments for the inverted Wishart distribution. Scandinavian Journal of Statistics, pp. 97–109. Cited by: §II-C.
  • [30] C. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl (2017) Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848. Cited by: §I, §I, §I, §IV-E, TABLE I, TABLE II.
  • [31] H. Xuan, A. Stylianou, and R. Pless (2020) Improved embeddings with easy positive triplet mining. In The IEEE Winter Conference on Applications of Computer Vision, pp. 2474–2482. Cited by: §I, §IV-E, TABLE I, TABLE II.
  • [32] M. Ye, X. Zhang, P. C. Yuen, and S. Chang (2019) Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6210–6219. Cited by: §I, §III-C.