Image Embedded Segmentation: Combining Supervised and Unsupervised Objectives through Generative Adversarial Networks

01/30/2020 ∙ by C. T. Sari, et al. ∙ Bilkent University Hacettepe University 9

This paper presents a new regularization method to train a fully convolutional network for semantic tissue segmentation in histopathological images. This method relies on benefiting unsupervised learning, in the form of image reconstruction, for the network training. To this end, it puts forward an idea of defining a new embedding that allows uniting the main supervised task of semantic segmentation and an auxiliary unsupervised task of image reconstruction into a single task and proposes to learn this united task by a single generative model. This embedding generates a multi-channel output image by superimposing an original input image on its segmentation map. Then, the method learns to translate the input image to this embedded output image using a conditional generative adversarial network, which is known to be quite effective for image-to-image translations. This proposal is different than the existing approach that uses image reconstruction for the same regularization purpose. The existing approach considers segmentation and image reconstruction as two separate tasks in a multi-task network, defines their losses independently, and then combines these losses in a joint loss function. However, the definition of such a function requires externally determining the right contribution amounts of the supervised and unsupervised losses that yield balanced learning between the segmentation and image reconstruction tasks. The proposed approach eliminates this difficulty by uniting these two tasks into a single one, which intrinsically combines their losses. Using histopathological image segmentation as a showcase application, our experiments demonstrate that this proposed approach leads to better segmentation results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 6

page 7

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Unsupervised learning has been used as a regularization tool to train a deep neural network for a supervised task. Earlier studies have used layer-wise unsupervised pretraining to initialize network weights, which will be then finetuned by supervised training using backpropagation. This pretraining may provide regularization on backpropagation by enabling it to start with a better solution. This, in turn, may improve the generalization ability of the network 

[1, 2]. On the other hand, it has been argued that the weights learned by pretraining may be easily overwritten during supervised training [3] or even these weights may not provide a better initial solution at all [4], since the network is pretrained independently and by being unaware of the supervised task.

Thus, for a more effective way of regularization, more recent studies have proposed to train the network to simultaneously minimize supervised and unsupervised losses by backpropagation [3, 4, 5, 6]

. For that, they define the supervised loss on the main task of classification and the unsupervised loss on an auxiliary task of image reconstruction. These supervised and unsupervised tasks typically share an encoder path to extract feature maps, from which a decoder path reconstructs an image and a classification path estimates a one-hot class label. In 

[6]

, in addition to this network, another autoencoder with its own encoder and decoder path is used and the outputs of the two decoders are combined to reconstruct the image. In these studies, the reconstruction loss is calculated between the original and decoded images as well as between the maps produced by the corresponding intermediate layers of the encoder and decoder paths. In 

[4], noisy original images are used as inputs and the reconstruction loss is calculated between these images and their denoised versions.

All these previous studies define losses on the classification and reconstruction tasks separately and linearly combine them in a joint loss function afterwards. They then use this function to simultaneously learn the classification and reconstruction tasks by backpropagation. This simultaneous learning may provide regularization since these tasks compete with each other during backpropagation. On the other hand, the effectiveness of this regularization highly depends on to what extent the supervised and unsupervised losses contribute to the joint loss function; in other words, it depends on their coefficients in the proposed linear combinations. When the coefficient of the unsupervised loss is too much compared to that of the supervised loss, the network may not sufficiently learn the main task of classification. On the contrary, when it is too small, the network may not learn the auxiliary reconstruction task, which results in not getting the expected regularization effect from unsupervised learning. Thus, it is important to select good coefficients that favor the supervised and unsupervised tasks with the “right” amount. However, depending on the application, this selection may not be always straightforward. It may become even harder when the joint loss includes more than one reconstruction loss (e.g., the one at the input level and those at the intermediate layers).

In response to these issues, this paper presents a new regularization method for semantic segmentation of histopathological images. This method relies on defining a new embedding that unites the main task of semantic segmentation (classification) and an auxiliary task of image reconstruction into a single task and learning this task by a single generative model. To this end, it first introduces an embedding that generates a multi-channel output image, on which segmentation is trivial, by superimposing an input image on its segmentation map. Then, it proposes to learn this newly generated output image from the input image using a conditional generative adversarial network (cGAN), which is known to be effective for image-to-image translations. This new embedding together with its learning by a cGAN provide two main advantages. Firstly, the proposed embedding unites segmentation and reconstruction tasks into a single one, which concomitantly results in combining supervised and unsupervised objectives (losses) in a very natural way. This approach is different than the previous studies, which define supervised and unsupervised losses separately and combine them by defining a joint loss function. The definition of such a function necessitates externally determining the right contribution amounts (coefficients) of the losses in this function that yield balanced learning between the segmentation and reconstruction tasks. On the contrary, the proposed approach eliminates this necessity as the definition of its united task intrinsically combines these losses. More importantly, since the output image of this united task corresponds to a segmentation map that preserves a reconstructive ability, uniting the segmentation and reconstruction tasks enforces the network to jointly learn image features and context features. This joint learning, in turn, provides effective regularization, leading to better segmentation results in our experiments. Secondly, the proposed method learns the output image of the united task by benefiting from the well-known synthesizing ability of cGANs. Thanks to using a cGAN, the method is able to produce more realistic output images that adhere to spatial contiguity without using any additional postprocessing steps (e.g., using conditional random fields [7]). To the best of our knowledge, this is the first proposal of using a cGAN to produce such kind of embedded output images that can be directly used for semantic segmentation. Using histopathological image segmentation as a showcase application, our experiments demonstrate that the introduction of this new embedding as well as the proposal of learning it with a cGAN improves the results of its counterparts.

Fig. 1: Schematic overview of the training phase in the iMEMS method.

Ii Related Work

Fully Convolutional Networks (FCNs) provide efficient solutions for semantic segmentation as they employ end-to-end training to generate segmentation maps that label every image pixel [8]. Training an FCN usually necessitates regularization, especially when some pixels are hard to learn and when the annotated data is limited. As a regularization tool to improve the training success, the previous studies have proposed multi-task FCN architectures that consider additional complementary tasks along with the main task of segmentation. For that, they construct a network with a shared encoder and many parallel decoders, each for a different task, and train this network by minimizing the joint loss defined on all decoders. These studies commonly focus on instance segmentation problems, and thus, typically define their additional tasks as predicting the boundary of instances [9].

Another commonly used regularization tool is employing unsupervised learning to train the network. This has been achieved in the form of defining an additional task of image reconstruction and concurrently learning it together with the main task. Most of the previous studies focus on non-dense prediction tasks, defining their main task as to predict one-hot class label for an entire image [3, 4, 5, 6]. Only a few studies consider the main task of image segmentation [10, 11]. However, as also mentioned in the introduction, all these studies use image reconstruction as an auxiliary task and linearly combines its loss and the loss of classification/segmentation, which are defined independently, in a joint loss function. This is different than our approach, which unites the image reconstruction and segmentation tasks into a single one through its proposed embedding and trains its network to minimize the loss defined on this united task. Additionally, these previous studies do not use a GAN for their network.

Segmentation methods based on FCNs typically use pixel-wise loss and train their networks to predict semantic labels of image pixels independent of each other. This may prevent to capture local and global spatial contiguity within an entire image. To recover the fine details, conditional random fields (CRFs) using pair-wise potentials have been employed as a post-processing step to refine the segmentation maps generated by FCNs [7, 12]. Although CRFs lead to improvements, the integration of FCNs and CRFs with higher orders is limited [13]. This limitation has led researchers to use generative adversarial networks for this purpose [14].

Generative adversarial networks (GANs) are primarily proposed for image synthesis by employing two networks, generator and discriminator, trained in an adversarial manner [15]. Their first application to semantic segmentation employs a cGAN, which gives an additional input to the generator (segmentor) to control its output [14]

. Since then, cGANs are shown to be useful for various image-to-image translation tasks, including semantic segmentation 

[16]. It is also proposed to use adversarial loss to regularize the training of other networks. In [17], it is used for an autoencoder to better learn the feature maps (the outputs of its encoder). For that, the encoder and decoder are trained to minimize the reconstruction loss between the encoder’s input and the decoder’s output, as usual. In addition to this, the encoder is considered as the generator of an adversarial network, and thus, its outputs are fed to the discriminator. Then, the encoder weights are updated also considering additional adversarial loss. In [18], it is proposed to design a multi-task network that first estimates the segmentation map from an image and then reconstructs the image from the estimated map for regularization. Since this design uses a cGAN for image reconstruction, it employs the adversarial loss in addition to the segmentation and image reconstruction losses. However, similar to the aforementioned studies, this design separately defines these losses and combines them in a linear joint loss function. Likewise, none of these studies exploit any embedding to combine supervised and unsupervised losses for regularizing their network for semantic segmentation.

Histopathological image segmentation

has been studied at different levels. At the tissue level, which is also the focus of our study, the aim is to divide an image into histologically meaningful tissue compartments. For that, one group of studies train a convolutional neural network (CNN) on image patches. They then classify a given image either with the label outputted by this trained CNN 

[19] or using another classifier trained on the feature maps of its intermediate layers [20]. Since a CNN predicts a single label for the entire image (but not for all of its pixels), a sliding-window approach is usually used to label the pixels. More recently, pixel-level predictions are inferred using another network trained on the posteriors of the CNN [21] and the feature maps of its intermediate layers [22]. Another group of studies train an FCN, usually a UNet which has long skip connections [23], to label the image pixels [24, 25]. It is also proposed to train multiple FCNs and fuse their predictions. For that, in [26], FCNs are trained on images of different resolutions. In another study [27], different FCNs are constructed by starting the upsampling operation from the different layers of the same encoder.

Other studies perform their segmentations on finer-levels; they usually segment nucleus and gland instances in tissue images. These studies typically use multi-task networks, which define auxiliary tasks and learn them along with the main task of instance segmentation. The auxiliary tasks are commonly defined as predicting the boundary of instances [9] and their bounding boxes [28]. It is also possible to use application specific additional tasks, such as lumen prediction [29] and malignancy classification [30] for gland instance segmentation.

Different than our proposed method, none of these studies define an embedding to unite the segmentation and image reconstruction tasks and use a cGAN for learning this united task. There exist only a few studies that use a cGAN for nucleus and gland instance segmentation [31, 32]. However, these studies define adversarial loss on the genuineness of their segmentation maps but they do not consider image reconstruction loss in their segmentation networks. Additionally, they do not exploit any embedding to regularize their training. Note that GANs are also used to synthesize additional training data [31, 33].

Iii Methodology

The proposed regularization method, which we call the iMage EMbedded Segmentation (iMEMS) method, defines a new embedding to transform semantic segmentation to the problem of image-to-image translation and then solves it using a conditional generative adversarial network (cGAN). Its motivation is as follows: The proposed transformation facilitates an easy and effective way of uniting a supervised task of semantic segmentation and an unsupervised task of image reconstruction into a single task. By its definition, learning this united task inherently requires meeting the supervised and unsupervised objectives simultaneously (i.e., it requires minimizing the segmentation and reconstruction losses at the same time). Thus, the network should jointly learn image features to segment an image and context features to reconstruct it. This joint learning stands as an effective means of regularizing the network training.

In the iMEMS method, the training phase starts with generating a multi-channel output image for each training instance by embedding an input image onto its segmentation map. Each channel of this output image corresponds to a segmentation label. Then, the original input images together with their generated outputs are fed to the cGAN for its training. The overview of this training phase is illustrated in Fig. 1. The details of the proposed embedding and cGAN architecture are presented in Sec. III-A and Sec. III-B, respectively. After this training, the output of an unsegmented image is estimated by the generator of the trained cGAN. Each of its pixels is classified with a segmentation label that corresponds to the output channel with the highest estimated value. This segmentation is explained in Sec. III-C

. The iMEMS method is implemented in Python using the Keras framework.

(a) (b) (c)
Fig. 2: (a) An original input image , (b) its ground truth segmentation map , and (c) the first channel in its output image . This output channel is generated for the segmentation label shown as green in . Note that our experiments use this semantic segmentation problem, which is a task of predicting one of the five labels for each pixel; this particular image does not contain any pixel belonging to the fifth label. Thus, the generated output image has five channels (i.e., , , , , and are generated for the input image). This figure shows only one of these channels.

Iii-a Proposed Embedding

Let be an RGB image in the training set, be its grayscale, and be its ground truth segmentation map that may contain possible labels. This embedding generates a -channel output image by superimposing the grayscale on the segmentation map . For that, for each segmentation label , it generates an output channel . For a pixel , this output channel is defined as follows:

(1)

This definition maps grayscale intensities of all pixels belonging to the -th label to the interval of [128, 255] in the -th output channel and to the interval of [0, 127] in all other channels. However, in mapping these intensities to the interval of [0, 127] in the other channels, it inverts their values. In other words, a grayscale intensity interval [0, 255] is mapped to [128, 255] in the -th output channel if a pixel belongs to the -th segmentation label, and to [127, 0] otherwise. The definition uses such an inversion in order to make the characteristics of pixels in the foreground and background regions of the -th channel more distinguishable.

The proposed definition is illustrated in Fig. 2. This figure depicts one of the output channels generated for an input image with respect to its ground truth segmentation map. As seen in this figure, foreground regions in this channel seem brighter, as they are mapped to the interval of [128, 255], whereas background regions seem darker, as they are mapped to the interval of [0, 127]. Thus, it is quite trivial to segment the foreground regions in this generated output image. Additionally, as also seen in this figure, both foreground and background regions in this output preserve the original image content, which helps regularize a network in learning how to distinguish these two regions.

Note that this definition requires having the ground truth segmentation map for an input image. Thus, the iMEMS method only employs this definition to generate the output images for segmented training instances. These generated output images are used to train a cGAN. Then, for an unsegmented (test) image, the iMEMS method estimates this output using the generator of the trained cGAN.

Fig. 3: Architecture of the generator network in the cGAN. Different layers and operations are indicated with different colors. The resolution of the feature maps in each layer together with the number of these feature maps are also indicated.
Fig. 4: Architecture of the discriminator network in the cGAN. Different layers and operations are indicated with different colors. The resolution of the feature maps in each layer together with the number of these feature maps are also indicated.

Iii-B cGAN Architecture and Training

The iMEMS method estimates an output image from an original RGB input image using a cGAN. In other words, it translates one image to another using a cGAN. The details of the generator and discriminator of this cGAN are given below.

The generator takes a three-channel normalized RGB image as an input and produces a -channel image as its output. For that, it trains a network with a UNet-based architecture [23], which is depicted in Fig. 3. It consists of an encoder and a decoder path that are connected by symmetric connections. The convolution layers employ filters whereas the pooling and upsampling layers use

filters. The ReLU activation function is used in the convolution layers except the last one. The last layer uses a linear function since this is a regression problem and the generator estimates continuous intensity values of the output image. Extra dropout layers are added to reduce overfitting; the dropout factor is set to 0.2.

The discriminator inputs a three-channel normalized RGB image as well as the -channel output image corresponding to this input image. Its output is a class label to indicate whether the -channel output image is real or fake. That is, it estimates whether this output image is calculated by Eqn. 1 using the ground truth segmentation map or produced by the generator. Its network architecture is given in Fig. 4. The operations used in this network are the same with the generator’s encoder. The convolution and pooling layers have and

filters, respectively. Extra dropout layers, with a dropout factor of 0.2, are added. The ReLU activation function is used in the convolution layers except the last one. The last layer uses the sigmoid function since this is a binary classification problem. This network uses a convolutional PatchGAN classifier 

[16], which uses local patches to determine whether the output image is real or fake rather than the entire image.

Both the generator and discriminator networks are trained from scratch using the training parameters proposed in [16] except the batch size. The loss settings are also the same with those of [16]

; the adversarial loss is defined on the outputs of the discriminator and the L1 loss is defined on those of the generator. The batch size is selected as 1 to fit the training and validation images on the GPU’s memory (GeForce GTX 2080 Ti). The network weights are learned on the training images for 300 epochs. At each epoch, the loss is calculated on the validation images and the network that gives the minimum validation loss is selected at the end.

 
Fig. 5: Output maps estimated by the generator of the cGAN for the image shown in Fig. 2.

Iii-C Tissue Segmentation

After training its cGAN, for an unsegmented image , the iMEMS method estimates its output using the generator of the trained cGAN and segments it based on this estimated output. In particular, it classifies each pixel with a segmentation label whose corresponding output has the highest estimated value; that is, . For the image shown in Fig. 2, the output images estimated by the cGAN are illustrated in Fig. 5.

Iv Experiments

 
 
 (a) (b) (c) (d) (e)
Fig. 6: Example images together with their annotations. In the annotations, each label is shown with a different color: normal (green), tumorous (red), connective tissue (yellow), dense lymphoid tissue (blue), and non-tissue (pink).

Iv-a Dataset

We test the iMEMS method on a dataset of 365 microscopic images of hematoxylin-and-eosin stained colon tissues. The tissue samples were collected from the Pathology Department Archives of Hacettepe University. Their images were taken using a Nikon Coolscope Digital Microscope with a objective lens. The image resolution was . These images are randomly divided into the training, validation, and test sets; they contain 76, 20, and 269 images, respectively.

In each image, non-overlapping regions are annotated with one of the five labels: normal, tumorous (colon adenocarcinomatous), connective tissue, dense lymphoid tissue, and non-tissue (empty glass and debris). However, this annotation is not perfect and may contain inevitable inconsistencies. This is due to the fact that small subregions of different labels may be found together, because of the nature of colon tissues, and separate annotation of these subregions is not always feasible (and not always that meaningful) for the selected magnification. In our dataset, there are three main factors contributing to the difficulty in annotations. Considering these factors, the images are annotated as consistent as possible, following the procedure explained below.

First, normal/tumorous regions typically consist of small connective tissue and non-tissue subregions. This is inevitable since a normal/tumorous region contains colon glands, which have a luminal area (empty looking subregion) inside. Additionally, it contains connective tissue as the supporting material in between the glands. Thus, in the annotations, such luminal areas and connective tissues are included into the corresponding normal/tumorous region. However, if there exists a “wide” enough connective tissue region in between the glands, it is separately annotated with the connective tissue label. For example, in Figs. 6(a) and 6(b), two small exemplary connective tissue subregions are indicated with the red arrows. These subregions are included in their corresponding normal and tumorous regions since they are relatively small. On the other hand, wider connective tissues are annotated as separate regions (yellow regions shown in the second row). Here we make every effort to be as consistent as possible to identify the wide regions. Likewise, in Fig. 6(c), the normal region contains many small empty (non-tissue) parts, some of which are shown with the blue arrows. These small parts are included into the normal region. However, the left-bottom corner of the image is annotated as a separate region since it belongs to the empty glass but not the tissue.

Second, due to the density heterogeneity in a colon tissue, the procedure of sectioning the paraffin-embedded tissue blocks may result in white artifacts (empty looking subregions). Examples of such white artifacts are shown with the black arrows in Figs. 6(d) and 6(e). When these artifacts are found next to a gland, it is included into the normal/cancerous region that the gland belongs to. Otherwise, it is included into the corresponding connective tissue region. Third, lymph cells are found almost everywhere in the tissue. The group of these cells is only annotated as a separate region when they form a dense lymphoid tissue, see Fig. 6(e). Likewise, we make every effort to be consistent to identify the dense regions.

F-scores
  Normal  Tumorous Connective  Lymphoid Non-tissue   Average  Accuracy
iMEMS 94.84 93.04 83.72 81.12 86.56 87.86 91.65
UNet-C-single 89.37 86.57 68.60 76.30 74.75 79.12 84.20
cGAN-C-single 88.31 88.82 75.87 72.87 78.80 80.93 86.00
UNet-R-single 92.12 91.07 79.66 75.41 75.38 82.73 88.87
UNet-C-multi 92.20 91.40 81.03 81.56 80.93 85.42 89.48
UNet-C-multi-int 92.96 91.45 78.24 83.41 79.10 85.03 89.32
TABLE I: Test set F-scores and accuracies of the proposed iMEMS method and the comparison algorithms.
 

Image

 

Annotation

 

iMEMS

 

UNet-C-single

 

cGAN-C-single

 

UNet-R-single

 

UNet-C-multi

 

UNet-C-multi-int

Fig. 7: Example test images, their annotations, and the visual results obtained by the proposed iMEMS method and the comparison algorithms. Each segmentation label is shown with a different color: normal (green), tumorous (red), connective tissue (yellow), dense lymphoid tissue (blue), and non-tissue (pink). Note that the results are embedded on the original images to improve visualizability.

Iv-B Results

Segmentation results are evaluated on the test images both visually and quantitatively. For quantitative evaluation, two metrics are used. The first metric is the pixel-level accuracy, which gives the percentage of correctly predicted pixels in all test images. The second one is the pixel-level F-score that is calculated for each of the five segmentation labels separately. The average of these five class-wise F-scores is also calculated.

The quantitative results are reported in Table I. These results show that the proposed iMEMS method gives high F-scores for all segmentation labels, leading to the best accuracy and the best average F-score. Here it is worth to noting that the dataset has imbalance class distribution; each of the dense lymphoid tissue and non-tissue labels has less than five percent of the training pixels. The iMEMS method yields high F-scores also for these minority classes. The visual results obtained on example test images are presented in Fig. 7. They reveal that the iMEMS method does not only give higher pixel-level performance metrics but also produces more realistic segmentations that adhere to spatial contiguity in the pixel predictions. This is attributed to the effectiveness of using the proposed embedding as the output and learning it with a cGAN. Since this output also includes the original image content, it provides regularization on the task of semantic segmentation. Additionally, since the discriminator performs real/fake classification on the entire output, it enforces the generator to produce embeddings that better preserve the shapes of the segmented regions.

Method name Network Output Task
iMEMS cGAN Proposed embedding Single-task regression
UNet-C-single UNet Segmentation map Single-task classification
cGAN-C-single cGAN Segmentation map Single task classification
UNet-R-single UNet Proposed embedding Single-task regression
UNet-C-multi UNet Segmentation map and reconstructed image Multi-task classification and image reconstruction
(reconstruction loss is calculated at the input level)
UNet-C-multi-int UNet Segmentation map and reconstructed image Multi-task classification and image reconstruction
(reconstruction loss is calculated at the input level
as well as the intermediate layers)
TABLE II: Summary of the algorithms used for the comparative study. The naming convention of these algorithms is x-y-z. X is the network type that the algorithm uses. Y is R (regression) if the estimated output is the proposed embedding and C (classification) if it is the segmentation map. Z indicates whether the algorithm uses a single-task or a multi-task network.

To better explore these two factors (namely, using the proposed embedding and learning it with a cGAN), we compare the iMEMS method with its counterparts. The comparison algorithms are summarized in Table II. As to be detailed below, these algorithms either estimate the original segmentation map or the proposed embedding using either a UNet or a cGAN. For fair comparisons, the algorithms that use the cGAN have the same architecture with our method (Figs. 3 and 4) and those that use the UNet have the architecture of our method’s generator (Fig. 3). If these algorithms estimate the proposed embedding, their last layer uses a linear function since it is a regression problem. Otherwise, if they estimate the segmentation map, their last layer uses a softmax function since it is a multi-class classification problem. The last two comparison algorithms use a multi-task network that concurrently learns the tasks of semantic segmentation and image reconstruction. These networks contain a shared encoder and two parallel decoders. Likewise, the architectures of the encoder and decoders are the same with those of the generator.

First, we compare the iMEMS method with three algorithms that consider none or only one of these two factors. The first algorithm, UNet-C-single, is the baseline that considers none of these factors; it estimates the original segmentation map using a UNet. The second one, cGAN-C-single, also estimates the segmentation map but this time with the cGAN also used by the iMEMS method. The last algorithm, UNet-R-single, estimates the proposed embedding not using this cGAN but using the same UNet with the UNet-C-single algorithm. The results of these algorithms (Table I and Fig. 7) show that the contribution of both factors is critical to obtain the best results. Furthermore, they show that the use of the proposed embedding provides an effective regularization tool for the network training regardless of the network type. UNet-R-single improves the results of UNet-C-single. Likewise, the iMEMS method improves the results of cGAN-C-single. However, using the proposed embedding together with the cGAN yields better improvement.

Next, we compare our method with another regularization technique that simultaneously minimizes supervised and unsupervised losses in the network training. This technique defines the supervised loss on the main task of image segmentation and the unsupervised loss on the auxiliary task of image reconstruction. For that, it constructs a multi-task network with one shared encoder and two parallel decoders and learns the network weights to minimize the joint loss function defined as a linear combination of the supervised and unsupervised losses [3, 5]. The supervised loss, is defined as the average cross-entropy on the segmentation map. For the unsupervised loss, two functions are used. The first one is the reconstruction loss, , defined at the input level; this is the mean square error between the input and the reconstructed images. The second one is the sum of the reconstruction losses, , at the intermediate layers. For each intermediate layer, this is the mean square error between the maps produced by the corresponding encoder and decoder paths. At the end, the following two comparison algorithms are implemented.

The UNet-C-multi algorithm linearly combines the supervised segmentation loss with the reconstruction loss at the input level without considering those defined at the intermediate layers. The UNet-C-multi-int algorithm also considers the losses of the intermediate layers. Here two variants are implemented since it becomes harder to select the right coefficients for each loss in the linear function as the number of losses increases. The following experiments are conducted to better understand this phenomenon.

The UNet-C-multi algorithm defines its joint loss function as follows, where and are the coefficients of the supervised and unsupervised losses, respectively.

(2)

Here to find a good combination of these coefficients, we set and perform the grid search on the test images. In Fig. 8(a), the average F-score and the accuracy are plotted as a function of . As expected, when is selected too small, the performance of the main segmentation task decreases dramatically. On the other hand, when it is selected very close to 1, the image reconstruction task cannot help improve the results. Based on this grid search, we set , which gives the best average F-score. Table I and Fig. 7 present the test set results when . These results show that a multi-task network, which regularizes its training by simultaneously minimizing the supervised and unsupervised losses, improves the results of the single-stage networks. On the other hand, the proposed iMEMS method leads to better results, indicating its effectiveness as a regularization tool. The superiority of the iMEMS method might be attributed to the following: First, it unites the supervised task of segmentation and the unsupervised task of image reconstruction into a single task and trains its network by minimizing the loss defined on this united task. This united task provides a very natural way of loss definition, eliminating the necessity of defining a proper joint loss function with right contribution amounts (coefficients) of the supervised and unsupervised losses. This, in turn, leads to a more effective regularization tool for employing unsupervised learning in network training. Second, the iMEMS method learns this united task by benefiting from the well-known synthesizing ability of cGANs. Thanks to using a cGAN, the iMEMS method produces realistic outputs that better comply with spatial contiguity, as also observed in Fig. 7.

 
 (a) (b)
Fig. 8: (a) Accuracy and average F-scores of the UNet-C-multi algorithm as a function of . (b) Accuracy and average F-scores of the UNet-C-multi-int algorithm as a function of . Here , which gives the best average F-score in the former algorithm.

The UNet-C-multi-int algorithm is the next method that also defines a linear joint loss function on the supervised and unsupervised losses. However, it also considers the sum of the reconstruction losses, , at the intermediate layers of the network. It defines the following joint loss function, which is also used in [3, 5] to regularize their network training.

(3)

As aforementioned, as the number of loss coefficients increases, it becomes harder to adjust the coefficients relative to each other. In our experiments, we use the best configuration of and selected by the UNet-C-multi algorithm, and determine the coefficient by also performing the grid search on the test images. In Fig. 8(b), the average F-score and the accuracy are plotted as a function of . Table I and Fig. 7 present the test set results when , which gives the best average F-score in this grid search. Here it is observed that the inclusion of the intermediate layer losses does not help further improve the results. The reason might be the following: The linear function, which is used by the UNet-C-multi-int algorithm as well as by the previous studies [3, 5], may not be the best way to combine these losses and/or it may require a more thorough coefficient search. On the contrary, the iMEMS method requires neither such an explicit joint loss function definition nor such a coefficient search since its proposed united task intrinsically combines these losses.

Iv-C Discussion

We also analyze the segmentation errors of the iMEMS algorithm. The most common one is segmenting connective tissue regions as tumorous. This could be typically because of incorrectly predicting tumor boundaries, especially for high-grade tumorous regions, or not detecting connective tissues in between two regions containing tumorous glands. These errors are visualized in Figs. 9(a) and 9(b), respectively. The other common error is confusing connective and dense lymphoid tissues, as visualized in Fig. 9(c). These two labels show the most similar visual characteristics among all. Less frequently, other confusions can also be observed in between different labels. Some of them are due to the inadequacy of the iMEMS method. As an example, normal regions are segmented as tumorous in Fig. 9(d). However, some of them are related with artifacts and difficulties encountered during their annotations (see Sec. IV-A). An example is given in Fig. 9(e), where an artifact in tumorous tissue is segmented as a non-tissue region.

 

Image

 

Annotation

 

iMEMS

(a) (b) (c) (d) (e)
Fig. 9: Examples of segmentation errors of the iMEMS method. Each segmentation label is shown with a different color: normal (green), tumorous (red), connective tissue (yellow), dense lymphoid tissue (blue), and non-tissue (pink).

From the results given in Fig. 7, it is observed that especially the comparison algorithms yield many small segmented regions, which can be easily corrected by post-processing. To understand how this affects the results, the following simple post-processing is applied to the results of all algorithms: Starting from the smallest one, each segmented region smaller than an area threshold is merged with its smallest adjacent region. This merge continues until there remains no region smaller than . The results reported in Table III indicate that this post-processing is effective to increase the performance. However, this increase is similar for all algorithms and does not change the conclusion drawn from the comparative study. Here a simple post-processing algorithm is used. One may design more sophisticated algorithms, also using a priori information on colon tissue characteristics. The design of such algorithms is left as future work of this study.

Average F-scores Accuracies
0 5000 10000 25000 50000 0 5000 10000 25000 50000
iMEMS 87.86 88.31 88.61 88.68 87.89 91.65 91.86 92.05 92.28 92.04
UNet-C-single 79.12 80.22 80.41 80.69 79.15 84.20 84.72 84.93 85.15 84.72
cGAN-C-single 80.93 82.57 82.98 82.79 81.99 86.00 87.06 87.41 87.74 87.85
UNet-R-single 82.73 83.34 83.45 83.35 82.29 88.87 89.18 89.33 89.43 89.27
UNet-C-multi 85.42 86.68 86.94 86.54 85.36 89.48 90.38 90.62 90.77 90.67
UNet-C-multi-int 85.03 86.93 87.18 87.27 86.04 89.32 90.15 90.27 90.38 90.21
TABLE III: Test set average F-scores and accuracies of the algorithms after post-processing. The results are reported when the area threshold is selected as pixels and when no post-processing is applied, i.e., when .

V Conclusion

This paper proposed the iMEMS method that employs unsupervised learning to regularize the training of a fully convolutional network for a supervised task. This method proposes to define a new embedding to unite the main supervised task of semantic segmentation and an auxiliary unsupervised task of image reconstruction into a single task and to learn this united task by a conditional generative adversarial network. Since the proposed embedding corresponds to a segmentation map that preserves a reconstructive ability, the united task of its learning enforces the network to jointly learn image features and context features. This joint learning lends itself to more effective regularization, leading to better segmentation results. Additionally, this united task provides an intrinsic way of combining the segmentation and image reconstruction losses. Thus, it attends to the difficulty of defining an effective joint loss function to combine the separately defined segmentation and image reconstruction losses in a balanced way. We tested this method for semantic tissue segmentation in histopathological images. Our experiments revealed that it leads to more accurate segmentation results compared to its counterparts.

The proposed method is to segment a heterogeneous tissue image into its homogeneous regions. Thus, it can be easily applied to segmenting tissue compartments in whole slide images (WSIs), as in the case of many previous studies. To do so, a WSI can be divided into image tiles, on which the method predicts the output. Alternatively, an image window can be slid on the WSI and the estimated outputs can be averaged to obtain the final segmentation. This application can be considered as one future research direction. This work used histopathological segmentation as a showcase application. Applying this method for other segmentation problems is considered as another future research direction.

References

  • [1] D. Erhan et al., “Why does unsupervised pre-training help deep learning?,” J. Mach. Learn. Res., vol. 11, pp.625–660, 2010.
  • [2] T. L. Paine, P. Khorrami, W. Han, and T. S. Huang, “An analysis of unsupervised pre-training in light of recent advances,” arXiv preprint arXiv:1412.6597, 2014.
  • [3] J. Zhao, M. Mathieu, R. Goroshin, and Y. LeCun, “Stacked what-where auto-encoders,” arXiv preprint arXiv:1506.02351, 2015.
  • [4] A. Rasmus et al.,

    Semi-supervised learning with ladder networks,” in

    Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 3546–3554.
  • [5] Y. Zhang, K. Lee, and H. Lee, “Augmenting supervised neural networks with unsupervised objectives for large-scale image classification,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 612–621.
  • [6] T. Robert, N. Thome, and M. Cord, “HybridNet: Classification and reconstruction cooperation for semi-supervised learning,” in Proc. Euro. Conf. Comp. Vision, 2018, pp. 153–169.
  • [7] L. C. Chen et al., “Semantic image segmentation with deep convolutional nets and fully connected CRFs,” in Proc. Int. Conf. Learning Repr., 2015.
  • [8] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in

    Proc. IEEE Conf. Comp. Vis. Pattern Recognit.

    , Jun. 2015, pp. 3431-3440.
  • [9] H. Chen et al., “DCAN: Deep contour-aware networks for object instance segmentation from histology images,” Med. Image Anal., vol. 36, pp. 135-146, Feb. 2017.
  • [10] L. Sun et al., “Joint CS-MRI reconstruction and segmentation with a unified deep network,” in Proc. Int. Conf. Inf. Process. Med. Imaging, 2019, pp. 492-504.
  • [11] H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen, “Multi-task learning for detecting and segmenting manipulated facial images and videos,” arXiv preprint, arXiv:1906.06876., Jun. 2019.
  • [12] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1520-1528.
  • [13] A. Arnab, S. Jayasumana, S. Zheng, P.H. Torr, “Higher order conditional random fields in deep neural networks,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 524-540.
  • [14] P. Luc, C. Couprie, S. Chintala, and J. Verbeek, “Semantic segmentation using adversarial networks,” in Proc. NIPS Workshop Adversarial Training, 2016.
  • [15] I. J. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672-2680.
  • [16]

    P. Isola, J. Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in

    Proc. Comp. Vis. Pattern Recognit., 2017, pp. 1125–1134.
  • [17] A. Makhzani et al., “Adversarial autoencoders,” arXiv preprint, arXiv:1511.05644., Nov. 2015.
  • [18] X. Zhu et al., “A novel framework for semantic segmentation with generative adversarial network,” Journ. Vis. Comm. Image Repr., vol. 58, pp. 532-543, 2019.
  • [19] J. Xu et al., “A deep convolutional neural network for segmenting and classifying epithelial and stromal regions in histopathological images,” Neurocomputing, vol. 191, pp. 214-223, 2016.
  • [20] Y. Xu et al., “Large scale tissue histopathology image classification, segmentation, and visualization via deep convolutional activation features,” BMC Bioinformatics, vol. 18, pp. 281, 2017.
  • [21] L. Chan et al., “HistoSegNet: Semantic segmentation of histological tissue type in whole slide images,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 10662-10671.
  • [22] S. Takahama et al., “Multi-stage pathological image classification using semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 10702-10711.
  • [23] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. Med. Image Comput. Assist. Intervent., 2015, pp. 234-241.
  • [24] T. De Bel et al., “Automatic segmentation of histopathological slides of renal tissue using deep learning,” in Proc. SPIE Med. Imaging, 2018, 1058112.
  • [25] K. R. J. Oskal et al., “A U-Net based approach to epidermal tissue segmentation in whole slide histopathological images,” SN Appl. Sci., pp. 1-672, 2019.
  • [26] J. Wang, J. D. MacKenzie, R. Ramachandran, and D. Z. Chen, “A deep learning approach for semantic segmentation in histology tissue images,” in Proc. Med. Image Comput. Assist. Intervent., 2016, pp. 176-184.
  • [27] A. Phillips, I. Teo, and J. Lang, “Segmentation of prognostic tissue structures in cutaneous melanoma using whole slide images,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2019.
  • [28] Y. Xu et al., “Gland instance segmentation using deep multichannel neural networks,” IEEE Trans. Biomed. Eng., vol. 64, no. 12, pp. 2901-2912, Mar. 2017.
  • [29] S. Graham et al., “MILD-Net: Minimal information loss dilated network for gland instance segmentation in colon histology images,” Med. Image Anal., vol. 52, pp. 199-211, 2019.
  • [30] A. BenTaieb, J. Kawahara, and G. Hamarneh, “Multi-loss convolutional networks for gland analysis in microscopy,” in Proc. IEEE Int. Symp. Biomed. Imaging, 2016, pp. 642-645.
  • [31] F. Mahmood et al., “Deep adversarial training for multi-organ nuclei segmentation in histopathology images,” IEEE Trans. Med. Imaging, 2019.
  • [32] L. Mei, X. Guo, and C. Cheng, “Semantic segmentation of colon gland with conditional generative adversarial network,” in Proc. Int. Conf. Biosci. Biochem. Bioinf., 2019, pp. 12-16.
  • [33] L. Bi, D. Feng, and J. Kim, “Dual-path adversarial learning for fully convolutional network (FCN)-based medical image segmentation,” Vis. Comp., vol. 34, pp. 1043-1052, 2018.