With the advent of the deep learning methods, image analysis algorithms leveled and, in some cases, surpassed human expert performance. But due to the lack of interpretability, the deep model decision is not transparent enough. Additionally, these models need a massive amount of the labeled data, which can be expensive and time consuming for medical data[tizhoosh2018artificial]
. To address the interpretability issue, one may evaluate the performance by enabling consensus for an example, by using retrieving similar cases. An unsupervised framework, such as the triplet loss, can be applied for training models to overcome the expensive label requirement. In triplet loss, triplets of anchor-positive-negative instances are considered where the anchor and positive instances belong to teh same class or are similar but teh negative instance is for another class or is dissimilar to them. The aim of triplet loss is to decrease and increase the intra-class and inter-class variances of embeddings, respectively, by pulling the anchor and positive closer and pushing the negative away[ghojogh2020fisher].
Since the introduction of the triplet loss, many updated versions have been proposed to increase efficiency and improve generalization. Furthermore, considering beneficial aspects of these algorithms, such as unsupervised feature learning, data efficiency, and better generalization, the triplet techniques are applied to many other applications, like representation learning in pathology images [teh2019metric, koch2015siamese, medela2019few, sikaroudi2020supervision] and other medical applications [wang2017multi]. Schroff et al. [schroff2015facenet]
proposed a method to encode images into a space with distances reflecting the dissimilarity between instances. They trained a deep neural network using triplets, including similar and dissimilar cases.
Later, a new family of algorithms emerged to address shortcomings of the triplet loss by modifying the triplets to better clarify the notion of the similarity for the network, while training. These efforts, such as Batch All (BA) [ding2015deep], Batch Semi-Hard (BSH) [schroff2015facenet], Batch Hard (BH) [hermans2017defense, peng2019multi], Neighborhood Components Analysis (NCA) [goldberger2005neighbourhood], Proxy-NCA (PNCA) [movshovitz2017no, teh2020learning], Easy Positive (EP) [xuan2020improved], and Distance Weighted Sampling (DWS) [wu2017sampling] fall into online triplet mining category where triplets are created and altered during training within each batch. As the online methods rely on mini-batches of data, they may not reflect the data neighborhood correctly; thus, they can result in a sub-optimal solution. In the offline triplet mining, a triplet dataset is created before the training session, while all training samples are taken into account. As a result, in this study, we investigate four offline and five online approaches based on four different extreme cases imposed on the positive and negative samples for triplet generation. Our contributions in this work are two-fold. As our first contribution, we have investigated four online methods in addition to the existing approaches, and five offline methods, based on extreme cases. Secondly, we are comparing different triplet mining methods for histopathology data to analyze them based on their patterns.
The remainder of this paper is organized as follows. Section II introduces the proposed offline triplet mining methods. In Section III, we review the online triplet mining methods and propose the new online methods based on extreme distances. The experiments and comparisons are reported in Section IV. Finally, Section V concludes the paper and reports the possible future work.
Notations: Consider a training dataset where denotes an instance in the -th class. Let and denote the mini-batch size and the number of classes, respectively, and be a distance metric function, e.g., squared norm. The sample triplet size per class in batch is . We denote the anchor, positive, and negative instance in the -th class by , , and , respectively, and their deep embeddings by , , and , respectively.
Ii Offline Triplet Mining
In the offline triplet mining approach, the processing of data is not performed during the training of the triplet network but beforehand. The extreme distances are calculated only once on the whole training dataset and not repeatedly in the mini-batches during the training. The histopathology patterns in the input space cannot be distinguished, especially for the visually similar tissues [jimenez2017analysis]. Hence, we work on the extreme distances in the feature space trained using the class labels. The block diagram of the proposed offline triplet mining is depicted in Fig. 1. In the following, we explain the steps of mining in detail.
Training Supervised Feature Space:
We first train a feature space in a supervised manner. For example, a deep network with a cross-entropy loss function can be used for training this space where the embedding of the one-to-last layer is extracted. We want the feature space to use the labels for better discrimination of classes by increasing their inter-class distances. Hence, We use a set of training data, call it, for training the supervised network.
Distance Matrix in the Feature Space: After training the supervised network, we embed another set of the training data, denoted by (where and ), in the feature space. We compute a distance matrix on the embedded data in the feature space. Therefore, using a distance matrix, we can find cases with extreme distances. We consider every as an anchor in a triplet where its nearest or farthest neighbors from the same and other classes are considered as its positive and negative instances, respectively. Again, we have four different cases with extreme distances, i.e., EPEN, EPHN, HPEN, and HPHN, in addition to the assorted case where one of the extreme cases is randomly selected for a triplet.
There might exist some outliers in data whose embeddings fall much apart from others. In that case, merely one single outlier may become the hardest negative for all anchors. We prevent this issue by a statistical test[aggarwal2017outlier] where for every data instance in embedded in the feature space, the distances from other instances are standardized using the -score normalization. We consider the instances having distances above the -th percentile (i.e., normalized distances above the threshold ) as outliers and ignore them.
Training the Triplet Network: After preparing the triplets in any of the extreme cases, a triplet network [schroff2015facenet] is trained using the triplets for learning an embedding space for better discrimination of dissimilar instances while holding the similar instances close enough. We call the spaces learned by the supervised and triplet networks as the feature space and embedding space, respectively (see Fig. 1).
Iii Online Triplet Mining
In the online triplet mining approach, the processing of data is performed during the training phase and in the mini-batch of data. In other words, the triplets are found in the mini-batch of data and not fed as a triplet-form input to the network. There exist several online mining methods in the literature which are introduced in the following. We also propose several new online mining methods based on extreme distances of data instances in the mini-batch.
Batch All [ding2015deep]: One of the online methods which considers all anchor-positive and anchor-negative in the mini-batch. Its loss function is in the regular triplet loss format, summed over all the triplets in the mini-batch, formulated as
where is the margin between positives and negatives and is the standard Hinge loss.
Batch Semi-Hard [schroff2015facenet]: The hardest (nearest) negative instance in the mini-batch, which is farther than the positive, is selected. Its loss function is
Batch Hard [hermans2017defense]: The Hardest Positive and Hardest Negative (HPHN), which are the farthest positive and nearest negative in the mini-batch, are selected. Hence, its loss function is
NCA [goldberger2005neighbourhood]: The softmax form [ye2019unsupervised] instead of the regular triplet loss [schroff2015facenet] is used. It considers all possible negatives in the mini-batch for an anchor by
where is the natural logarithm and is the exponential power operator.
Proxy-NCA [movshovitz2017no]: A set of proxies , e.g., the center of classes, with the cardinality of number of classes is used. An embedding is assigned to a proxy as for memory efficiency. PNCA uses the proxies of positive and negatives in the NCA loss:
Easy Positive [xuan2020improved]: Let be the easiest (nearest) positive for the anchor. If the embeddings are normalized and fall on a unit hyper-sphere, the loss in EP method is
Our experiments showed that for the colorectal cancer (CRC) histopathology dataset [kather2019predicting], the performance improves if the inner products in Eq. (6) are replaced with minus distances. We call the EP method with distances by EP-D whose loss function is
Distance Weighted Sampling [wu2017sampling]: The distribution of the pairwise distances is proportional to [wu2017sampling]. For a triplet, the negative sample is drawn as . The loss function in the DWS method is
Extreme Distances: We propose four additional online methods based on extreme distances. In the mini-batch, we consider every instance once as an anchor and take its nearest/farthest same-class instance as the easiest/hardest positive and its nearest/farthest other-class instance as the hardest/easiest negative instance. Hence, four different cases, i.e., Easiest Positive and Easiest Negative (EPEN), Easiest Positive and Hardest Negative (EPHN), Hardest Positive and Easiest Negative (HPEN), and Hardest Positive and Hardest Negative (HPHN), exist. Considering the extreme values, especially the farthest, was inspired by the opposition-based learning [tizhoosh2005opposition, tizhoosh2008oppositional]. HPHN is equivalent to BH, already explained. We can also have a mixture of these four cases (i.e., assorted case) where for every anchor in the mini-batch, one of the cases is randomly considered. The proposed online mining loss functions are as follows:
where denotes random selection between the minimum and maximum operators.
Iv Experiments and Comparisons
Dataset: We used the large colorectal cancer (CRC) histopathology dataset [kather2019predicting] with 100,000 stain-normalized patches. The large CRC dataset includes nine classes of tissues, namely adipose, background, debris, lymphocytes (lympho), mucus, smooth muscle, normal colon mucosa (normal), cancer-associated stroma, and colorectal adenocarcinoma epithelium (tumor). Some example patches for these tissue types are illustrated in Fig. 2.
Experimental Setup: We split the data into 70K, 15K, and 15K set of patches, respectively, for , , and the test data, denoted by . We used ResNet-18 [he2016deep] for the backbone of both the supervised network and the triplet network. For the sake of a fair comparison, the mini-batch size in offline and online mining approaches was 48 (16 sets of triplets) and 45 ( samples per each of the
classes), respectively, which are roughly equal. The learning rate, the maximum number of epochs, and the margin in triplet loss were, , and , respectively. The feature-length and embedding spaces were both .
Offline Patches with Extreme Distance: Figure 3
depicts some examples for the offline created triplets with extreme distances in the feature space. The nearest/farthest positives and negatives are visually similar/dissimilar to the anchor patches, as expected. It shows that the learned feature space is a satisfactory subspace for feature extraction, which is reasonably compatible with visual patterns.
Relation of the Colorectal Tissues: The chord diagrams of negatives with extreme distances in offline mining are illustrated in Fig. 4. In both the nearest and farthest negatives, the background and normal tissues have not been negatives of any anchor. Some stroma and debris patches are nearest negatives for smooth muscle, as well as, adipose for background patches, and lympho, mucus, and tumor for normal patches. It stems from the fact that the patterns of these patches are hard to discriminate, especially tumor versus normal and stroma and debris versus smooth muscle. In farthest negatives, lympho, debris, mucus, stroma, and tumor are negatives of smooth muscle, as well as debris, smooth muscle, and lympho for adipose texture, and adipose and smooth muscle for normal patches. It is meaningful since they have different patterns. Different types of negatives are selected in the assorted case, which is a mixture of the nearest and farthest negative patches. It gives more variety to the triplets so that the network sees different cases in training.
Offline versus Online Embedding: The evaluation of the embedding spaces found by different offline and online methods are reported in Tables I and II, respectively. The Recall@ (with ranks 1, 4, 8, and 16) and closest neighbor accuracy metrics are reported.
In offline mining, HPHN has the weakest performance on both training and test sets, showing whether the architecture or embedding dimensionality is small for these strictly hard cases or the network might be under-parameterized. We performed another experiment and used ResNet-50 to see whether a more complicated architecture would help [hermans2017defense]. The results showed that for the same maximum number of epochs, either would increase embedding dimensionality to or utilizing the ResNet-50 architecture increased the accuracy by 4%. The test accuracy in online mining is not as promising as in offline mining because in online mining we only select a small portion of each class in a mini-batch. The chance of having the most dissimilar/similar patches in a mini-batch is much lower than the case we select triplets in an offline manner. In other words, mining in mini-batches definitely depends upon a representative population of every class in each batch. Besides, the slightly higher training accuracy of the online manner compared to offline mining can be a herald of overfitting in online mining. Tables I and II show that the easiest negatives have comparable results. It is because the histopathology patches (specifically this dataset) may have small intra-class varience for most of the tissue types (e.g., lympho tissue) and large intra-class variance for some others (e.g., normal tissue). Moreover, there is a small inter-class variance in these patches (with similar patterns, e.g., the tumor and normal tissue types are visually similar); hence, using the easy negatives would not drop the performance drastically. Moreover, as seen in Fig. 4, the hardest negatives might not be perfect candidates for negative patches in histopathology data because many patches from different tissue types erroneously include shared textures in the patching stage [kather2019predicting]. In addition to this, the small inter-class variety, explains why the hardest negatives struggle in reaching the best performance, as also reported in [hermans2017defense]. Furthermore, literature has shown that the triplets created based on the easiest extreme distances can avoid over-clustering and yield to better performance [xuan2020improved], which can also be acknowledged by our results. The assorted approach also has decent performance. Because both the inter-class and intra-class variances are considered. Finally, offline and online ways can be compared in terms of batch size. Increasing the batch size can cause the training of the network to be intractable [movshovitz2017no]. On the contrary, a larger batch size implies a better statistical population of data to have a decent representative of every class. An ideal method has a large batch size without sacrificing the tractability. The online approaches can be considered as a special case of offline mining where the mini-batch size is the number of all instances. The offline approach is tractable because of making the triplets in pre-processing. Hence, offline mining has promising performance in Table I compared to online mining in Table II.
Retrieval of Histopathology Patches: Finally, in Fig. 5, we report the top retrievals for a sample tumor query [kalra2020pan]. As the figure shows, EPEN, HPEN, and assorted cases have the smallest false retrievals among the offline methods. In online mining, BSH, DWS, EPEN, and HPEN have the best performance. These findings coincide with Tables I and II results showing these methods had better performance. Comparing the offline and online methods in Fig. 5 shows that more number of online approaches than offline ones have false retrievals demonstrating that offline methods benefit from a better statistical population of data.
V Conclusion and Future Direction
In this paper, we comprehensively analyzed the offline and online approaches for colorectal histopathology data. We investigated twelve online and five offline mining approaches, including the state-of-the-art triplet mining methods and extreme distance cases. We explained the performance of offline and online mining in terms of histopathology data patterns. The offline mining was interpreted as a tractable generalization of the online mining where the statistical population of data is better captured for triplet mining. We also explored the relation of the colorectal tissues in terms of extreme distances.
One possible future direction is to improve upon the existing triplet sampling methods, such as [wu2017sampling], for online mining and applying that on the histopathology data. One can consider dynamic updates of probabilistic density functions of the mini-batches to sample triplets from the embedding space. This dynamic sampling may improve embedding of histopathology data by exploring more of the embedding space in a stochastic manner.