With the rising up of deep learning, semantic segmentation achieves prominent progress. However, the semantic segmentation of ultra-resolution image (URI) is seldom studied, especially in the application of medical diagnosis. Many medical URIs[17, 2] contain more than M pixels per image, and as for whole-slide image (WSI), a special type of medical URIs, its size even exceeds (about M pixels). The URIs with huge size require large computational resources, which some most popular semantic segmentation framworks, such as UNet , PSPNet , and DeepLab [5, 6], are hard to afford.
There are two common ways to process URIs: image downsampling and sliding patches [1, 4]. The former resizes a large image to a suitable size, e.g., , then feeds it into the model, which leads to the great loss of local details, especially for WSIs. The latter crops original image into many small patches, then segments on patch-level, and finally combines the segmentation results of these patches. While these methods can effectively reduce the computational burden, the global information that provided by spatial context and neighborhood dependency is almost abandoned, which makes it difficult to obtain accurate segmentation results.
The latest representative patch-based method is the AWMF-CNN , it is a multi-branch structure to aggregate contextual information from multiple magnification patches which contain target regions and receptive fields in different resolutions and scales. As a popular strategy, multi-branch induced methods find a tradeoff between small inputs and multi-scales, however, they inevitably introduce two challenge problems: Firstly, they usually need a carefully designed fusion mechanism for final result, e.g., the fusion layers consist of many stacked convolutions with relatively more channels or the auxiliary weighting net in , resulting in a complicated and redundant structure. Fortunately, with developing a fusion method via meta-learning, only a simple structure is needed in our method to ensure good results. Secondly, all the branches are independent so as to be trained from scratch separately, increasing overall parameters significantly.
In this paper, we propose a novel multi-branch based framework guided by a meta-learning way for ultra-resolution medical images segmentation, namely Meta Segmentation Network (MSN). Recently, meta-learning has attracted increasing attentions. The negative loss gradient, which contains more detail information, e.g., the target-specific difference between prediction and label, can be used as a very useful information for fast weights generation for convolution, as it has been confirmed in . Moreover, the structures of most meta-learning frameworks are quite simple but highly effective [8, 13], hence the elaborative structural design is unecessary.
For this purpose, we develop a meta-fusion mechanism, which can elegantly solve the first challenge in multi-branch methods. Specifically, we use the negative gradients of the output layers of branches as the meta-information to train a meta-learner, and directly predict the weights of the fusion layer. Our method is superior to the training way of traditional end-to-end BP that needs more iterative steps and more training samples to converge. It is also noticeable that, unlike the elaborately designed fusion structure in AWMF-CNN, i.e., many stacked convolution layers as well as a redundant weighting net, the structure of meta-fusion contains only two convolutions with a few channels and a simple meta-learner.
To avoid learning all the branches from scratch, we further introduce a quite effective weight sharing mechanism. Although the inputs for those branches have different magnifications, they are still in the same domain. So, we believe that the knowledge among these branches can be shared to some extent, i.e., weight sharing. In our weight sharing, we adopt a special memory mechanism to achieve fast knowledge adaptation between meta-branch and non-meta-branches. The meta-branch represents a reference branch containing basic parameters that need to be shared with other branches. Moreover, we experimentally find that direct weight sharing leads to some knowledge gaps, as illustrated in Fig. 1. To bridge these gaps, we use the memory mechanism to store some useful memory (feature) from the meta-branch, then make a memory transformation between meta-branch and non-meata-branches to realize a fast knowledge adaption.
The contributions can be summarized as follows:
Meta-fusion mechanism is proposed for a multi-branch deep model for URI segmentation by utilizing meta-learning. The weights of fusion layer can be fast generated through the meta-learner, leading to a simple but highly effective model.
A novel weight sharing mechanism is introduced to realize fast knowledge adaptation, resulting in significant reduction in training process and overall parameters.
The proposed MSN achieves the best performance on two challenging datasets: BACH and ISIC. Especially, our method achieves a significant performance improvement over the latest AWMF-CNN, and the overall parameters are close to that of a single branch. Thus, it is a practical segmentation method both in resource-saving and accuracy.
2 Proposed Method
2.1 Architecture of MSN
In this section, we introduce the framework of MSN, whose architecture is illustrated in Fig. 2. It mainly contains two components: the multi-branch structure, named Mainbody and the Meta Fusion Module (Meta-FM). Mainbody is an all-in-one structure which realizes multi-resolution segments. Unlike the general multi-resolution structure which requires the multiple branches with a special resolution per branch, Mainbody only uses one branch to realize the multi-resolution segmentation. Let’s name the branches as the high-resolution, middle-resolution and low-resolution branches corresponding to the counterpart resolution inputs. Mainbody containts three key parts: the meta-branch, the Memory Feature Pool (Mem-FP) and the Memory Recall Module (Mem-RM). Mem-FP stores the meta-features of the meta-branch in Mainbody, while Mem-RM is deployed in non-meta-branch to complement the distinctive features from the non-meta branch with the meta-features stored in Mem-FP, named Memory.
Mainbody outputs two preliminary segmentation maps, i.e., and in Fig. 2, which will be fused in a meta-learning way to achieve a final segmentation result. In the following, we will detail the operating mechanism of Mainbody, Mem-FP, Mem-RM, and Meta-FM.
The architecture of Mainbody is shown in the middle part of Fig. 2. Let , and , denote the three types of inputs with the same size having different resolution corresponding to the three branches (e.g., , , ). Actually, has the widest receptive field with the lowest resolution, while is the opposite. As shown in Fig. 2, is the upscaled patch centered in signed in a green box and has the high-resolution and has the middle-resolution. Mainbody can process the image patches with three resolutions and output the counterpart segmentation maps .
Considering the commonness and difference among the knowledge of the multi-resolution segmentations, we treat the low-resolution branch as the meta-branch which share the weights with the middle-resolution and high-resolution branches because it contains the most information, and use Mem-FP and Mem-RM to adjust the weight learning. In detail, a low-resolution image patch is fed into Mainbody, and passed through the non-gap convolution layers and gap convolution layers in the meta-branch. In gap layers, the obtained feature maps are recorded in Mem-FP. As for the high-resolution image patch , after fixing all the layers of the meta-branch, it is passed through Mainbody just like in non-gap convolution layer, when meeting gap convolution layer, Memory is recalled from Mem-FP, and Memory as well as the feature maps output by the current gap layer are fed into the Mem-RM for adjusting the weight learning. So do the middle-resolution image patch . Subsequently, we fuse the two outputs , of non-meta-branches in Meta-FM. In the following, we introduce the Mem-FP, Mem-RM,and Meta-FM.
Memory Feature Pool.
Mem-FP acts as a storage pool. As shown in Fig. 1, the branches of and have a large gap with (meta-branch) in some layers of CNN. When processing the low-resolution image patch , feature maps output by the gap layers in the meta-branch, named as Meta-features, is saved in Mem-FP which will be utilized to compensate other branches. Actually, once is passed through the meta branch, the obtained Meta-features are stored.
Memory Recall Module.
In order to make the meta-branch adapt to other branches in the weight sharing mechanism, non-meta-branches should “recall” the missing features in Mem-FP at these big gap layers. Therefore, we construct Mem-RM and embed it in the meta-branch as an auxiliary module to recall the memory of the gap layers in the meta-branch. Specifically, as shown in Fig. 1, when an image patch is fed into the non-meta-branch, such as or , it is passed forward along the meta-branch until it meets the gap layers. At the gap layers, both the pre-saved meta-feature and the counterpart feature from non-meta-branch are fed into Mem-RM. As shown in Fig.3, there are two input branches: the top branch inputs Meta-feature of (A) and the bottom branch inputs the output feature maps of or (B). In order to align the feature maps between A and B, we crop the target region centered in Meta-feature and upscaled to the same size as B. After that we concatenate them and implement convolutions on them. The process is formulated as:
where is final output of Mem-RM, is the nonlinear transformation function, and , and are the operations of concatenation, upsampling and cropping, respectively.
Meta Fusion Module.
The final step of our framework is the fusion of different branches. Since the branches of and have already captured the memory of , we only need to consider the fusion of the branches of and . One of the most common way is to use an elaboratively designed structure that might contain dozens of convolution layers to perform feature fusion, and then conduct common optimizatizer, e.g., SGD , to adjust the parameters of these convolutional layers. This process may take many iterations for optimization to achieve convergence.
To pursue a compact and simple but highly effective fusion structure, we propose using a specific target provided by an auxiliary meta-learner. It is well known that, the negative gradient, which is used in SGD to determine the direction of optimization, contains the detail information that measure the difference between prediction and ground truth. Draw lesson from the theory of negative gradient, we construct Meta-FM to predict the weights of these convolutional layers directly. The structure of Meta-FM is shown in Fig. 2, and Meta-FM receives the negative gradients (meta-information) of the output layers of two branches, and output the predicted weights through two fully connected layers (FC). Meta-FM can be formulated as:
where and are the parameters of two fusion convolutions, respectively. Note that, and
should be reshaped to the weight matrixs because the output of FC is a vector.
is a nonlinear function which contains the structure of FC-Relu-FC.is the gradient vector of the output layers of two branches, which is formulated as:
is the loss function,is the segmentation prediction of the branch of , and is the ground truth. , since the resolution of is lower than , we crop the target region from and upsample it to the size of . and are the weights of the output layers of the branch of and , respectively. The operation reshapes the gradient matrix to a column vector.
2.2 Loss Function
We use the cross entropy as the loss function of our model, which can be formulated as:
where is the predicted segmentation maps, denotes the counterpart ground truths, is the total number of samples, and is the -th pixel of . This loss funtion will be used in multiple segmentation results of our model, i.e., as well as the final fused result , to train our model.
We adopt a -step training scheme to train MSN. Step . we train the meta-branch in Mainbody to obtain the meta parameters that will be shared with other branches. Step . the Mem-RM is trained for the non-meta-branches to fix knowledge gaps. Step . Meta-FM is learned to fuse the multi-resolution segmentation results. We divide the training data into two parts: training set and sub-training set. The sub-training set is much smaller than the training set. The training set is used for the first step, while the sub-training set is involved in the second and third steps. We initialize all layers similar to .
The low-resolution image patch is fed into the branch and obtain the segmentation map , and then the weights of this branch is updated with the loss function formulated in Eq. (4).
After Step , we have the meta parameters and fix them. Next we train Mem-RM on sub-training set to alleviate the influence of gap layers w.r.t the meta-branch. We input or to the fixed meta-branch with their specific Mem-RMs and get the counterpart segmentation results and . Then we use the loss function and to update each Mem-RM which is specific to the branch of or .
Firstly, we fix the trained meta-branch and Mem-RM. Then we obtain the segmentation maps , by using the same way of Step . After some operation as mentioned before, e.g., cropping and concatenation, finally we feed the processed and to the fusion layers whose weights are generated by Meta-FM, and obtain the fusion result
. Since the reshape operation on the weights vector before padding into fusion layer is differentiable, we thus tune the parameters of Meta-FM in few epochs by minimizing the loss functionon sub-training set.
In this section, we evaluate our method on two ultra-resolution medical datasets: BACH and ISIC. We take two criteria for evaluation: the mean Intersection over Union (mIoU) and the amount of model parameters.
BACH  is composed of Hematoxylin and Eosin (HE) stained breast histology microscopy and whole-slide images (WSI). There are WSIs, with an average size of pixels (about M pixels), included in BACH. These WSIs are stored in a multi-resolution pyramid structure, i.e., , and . Four classes are presented in BACH: normal, benign, in situ, and invasive carcinoma. We randomly split WSIs into , , images for the training set, the sub-training set and the test set, respectively.
ISIC [17, 7] is an ultra-resolution medical dataset for pigmented skin lesions, which total contains images. Its average resolution is up to M, while the highest resolution is up to . The dense annotations contain two classes: lesion, normal. We randomly divide the dataset into training, sub-training and testing sets with , and images.
3.2 Implementation Details
In our model, we use BiSeNet  as backbone, i.e., the CNN structure in Mainbody which contains three branches: the high-resolution branch, the middle-resolution branch and the low-resolution branch. We feed the patch with the size of into MSN. We firstly crop out from left to right in image without overlapping except the last patch in each row. Then we align the center of the target area to crop out and , if the cropping patch exceeds the boundary, then is padded. As for BACH, we use the professional tool “OpenSlide”  to read the multi-resolution pyramid in WSI, where the resolutions of the input patches fed into the three branches are , , and , respectively. We finally get training, sub-training and test set with , and patches for each resolution. As for ISIC, we set three resolutions as , and , where the original image is considered as the highest resolution. Then we crop out , and in these three-resolution images, respectively. The number of patches in each resolution for training, sub-training and test set are , and , respectively. We train the meta-branch for epochs, and tune the non-meta-branches as well as Meta-FM for only epochs, with the batch size of . The optimizer Adam  is utilized with initial learning rate
to update the parameters of network. The whole model is trained in PyTorch with a single Ti GPU.
|Method||mIoU (%)||Parm (M)|
|AWMF-CNN (fixed) (2019)||28.6||31.3||42.5||42.9||76.3|
|AWMF-CNN (fixed) (2019)||25.2||31.7||37.9||38.7||61.2|
3.3 Comparisons with State-of-the-art Methods
Result on BACH Dataset.
We compare our method with five state-of-the-art methods: UNet , PSPNet , BiSeNet , DeepLab-V3+  and AWMF-CNN , where the first four methods are representative general semantic segmentation frameworks, and the last one is the latest powerful multi-branch method for processing medical URIs. Because the first four methods are not the multi-branch structure, the fusion results are not available, which are denoted by “-” in Table 1. All methods have publicly provided code except AWMF-CNN, thus we reproduce it using Pytorch. Moreover, in , AWMF-CNN uses UNet as backbone, we also implement our method using the same backbone without loss of generality. We train all competitors by using Eq. (4) on the training set. For the first four methods we train each model with a specific resolution for epochs. For AWMF-CNN, we adopt two training ways: 1) Similar to the original way in AWMF-CNN, firstly we pretrain its three branches for epochs, then train the fusion parts. After that, we alternately train the multi-resolution branches and fusion part for 20 epochs. 2) we only train its fusion part for epochs with the fixed trained branches, we denote it as AWMF-CNN (fixed). For the convenience of expression, we use different marks in the superscript to denote the different settings: “”: use BiSeNet as backbone; “”: use UNet as backbone; “”: similar to other comparison methods, we train MSN on the training set. For all methods, we report the best results in the test set.
As shown in Table 1, we observe that our method achieves the best results. Note that, with the help of our special weight sharing mechanism, we improve the result significantly for the non-meta-branches by almost mIoU compared with the counterpart branches of BiSeNet and UNet, respectively (For example, the branch of of MSN achieves mIoU, while the one of BiSeNet only get ). Meanwhile, the result can be further boosted with our meta-fusion.
It can be also found that our method already obtained the best results by only training on the small sub-training set, e.g., MSN, while other comparison methods are trained on the training set. When we also train on training set, e.g., MSN, we can get better performance. Therefore, it can be concluded that MSN is more flexible in data requirements.
Another important point is that, the amount of parameters of MSN is almost the same as that of a single network (see MSN vs BiSeNet and MSN vs UNet), and is much smaller than AWMF-CNN, thus our model has lower complexity and is more practical.
|Method||mIoU (%)||Parm (M)|
|AWMF-CNN (fixed) (2019)||45.1||46.4||46.1||48.9||76.3|
Result on ISIC Dataset.
The comparison results on ISIC are shown in Table 2. For fast implementation without loss of generality, we compare MSN with the latest three methods: DeepLab-V3+, BiSeNet and AWMF-CNN. MSN also achieves the best result with the comparable amount of parameters compared to other methods.
Finally, we visualize the results on BACH and ISIC. Due to space limitation, we directly compare our method (BiSeNet as backbone) with BiSeNet that trains three branches separately. The results are illustrated in Fig. 4. Obviously, with the special weight sharing mechanism, the non-meta-branches of MSN significantly outperform all the branches of BiSeNet. More importantly, with our meta-fusion mechanism, some details can be further refined, which makes the final result more complete.
3.4 Ablation Study
Effectiveness of Weight Sharing Mechanism.
The special weight sharing mechanism can not only significantly reduce the amount of parameters of multi-branch model, but also realize the knowledge transfer between the branches. Thus the results of the non-meta-branches can be promoted on the basis of the meta-branch, which has been verified in Section 3.3. To further verify its effectiveness, we design the following experiments:
Firstly, we compare four methods: (1) Meta-branch: we only use the trained meta-branch to obtain the results of all resolution inputs, without fixing the gap layers. (2) Multi-branch: all branches in this structure are separately trained from scratch. (3) MSN and (4) MSN. The backbone of all methods are BiSeNet. For fair comparison, we also conduct our meta-fusion mechanism on the first two compared methods. The results are shown in Table 4. It is observed that our non-meta-branches outperform other methods significantly, it shows that our weight sharing mechanism can effectively eliminate the gaps between meta-branch and other branches, then improve the performance by leveraging existing knowledge. And the fusion results are based on branches results, therefore, the improvement of branches is also conducive to the improvement of final performance.
Secondly, we compare the convergency of the non-meta-branches of MSN and the ones of Multi-branch (training on training set). The curves are illustrated in Fig. 5. It shows that our method not only performs on segmentation better, but converges faster. And when we use the training set rather than the sub-training set for training (MSN), we can obtain better convergence performance.
|on non-gap layers||33.7||30.3||42.5||34.5|
Thirdly, to explore the impact of ‘gap layers’, we attempt to only add Mem-RM at the ‘non-gap layers’, i.e., the layers whose index in black font in Fig. 1. As expected, the result of this approach drops a lot, which shows that our effort to fix the gaps between meta-branch and other branches is reasonable.
|Dataset||Method||mIoU (%)||# F (M)|
Effectiveness of Meta-fusion.
To verify the effectiveness of the meta-fusion mechanism, we compare three methods: (1) w/o Meta: the same fusion structure as ours without meta-fusion mechanism, i.e., two convolution layers in Fig. 2, and we train it end-to-end from scratch. (2) AWMF-CNN: the fusion mechanism in AWMF-CNN, which introduces a heavy weighting net for branches weighting, then uses some convolution layers training from scratch for fusion. (3) MSN: our meta-fusion mechanism. For fair comparison, we fix the results of three resolutions, which come from our trained three branches whose backbones are BiSeNet, and then we train the fusion part of all comparison methods on the sub-training set.
The comparison results on BACH and ISIC are shown in Table 5. We can observe that our method outperform w/o Meta’s significantly. Although we have more parameters than it, it is almost negligible due to the small order of magnitude. The performance of AWMF-CNN’s is a little lower than ours, but its fusion structure is more complicated than ours, resulting in a sharp increase in the amount of parameters.
We further illustrate the train trend of meta-fusiion mechanism and w/o Meta’s on BACH and ISIC in Fig. 6. As expected, our method achieves an extremely better convergency, e.g., it converges in almost epoch.
In this work, we propose MSN for the effective segmentation of medical URIs. A novel meta-fusion module with a very simple but effective structure is introduced for branches fusion through a meta-learning way. Moreover, MSN achieves a lightweight multi-branch structure with the help of our particular weight sharing mechanism. The experimental results on BACH and ISIC demonstrate that our method achieves the best comprehensive performance.
-  (2009) Color graphs for automated cancer diagnosis and grading. IEEE Transactions on Biomedical Engineering 57 (3), pp. 665–674. Cited by: §1.
-  (2019) Bach: grand challenge on breast cancer histology images. Medical image analysisMedical image analysis. Cited by: §1, §3.1.
-  (2012) Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pp. 421–436. Cited by: §2.1.
Stacked predictive sparse decomposition for classification of histology sections.
International journal of computer vision113 (1), pp. 3–18. Cited by: §1.
-  (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062. Cited by: §1.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801–818. Cited by: §1, §3.3.
-  (2018) Skin lesion analysis toward melanoma detection: a challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 168–172. Cited by: §3.1.
Model-agnostic meta-learning for fast adaptation of deep networks.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §1.
-  (2013) OpenSlide: a vendor-neutral software foundation for digital pathology. Journal of pathology informatics 4. Cited by: §3.2.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §2.3.
-  (2017) Introduction to pytorch. In Deep learning with python, pp. 195–208. Cited by: §3.2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
-  (2019-10) MetaPruning: meta learning for automatic neural network channel pruning. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
-  (2017) Meta networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2554–2563. Cited by: §1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1, §3.3.
Adaptive weighting multi-field-of-view cnn for semantic segmentation in pathology.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12597–12606. Cited by: §1, §3.3.
-  (2018) The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data 5, pp. 180161. Cited by: §1, §3.1.
-  (2018) Bisenet: bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 325–341. Cited by: §3.2, §3.3.
-  (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §1, §3.3.