Today, ultra-high resolution images are widely accessible to the computer vision community and therefore there is an increasing demand for effective analysis. Such images are commonly used in an array of scientific tasks, such as urban scene analysis with image size ofcordts2016cityscapes , geospatial analysis with image size of maggiori2017can and histopathological images with image sizes up to pixels srinidhi2019deep . In particular, semantic segmentation, a problem of assigning each pixel to different semantic categories, forms an integral component of such analysis as it allows for a granular understanding of the scenes. However, manual analysis of ultra-high resolution images is extremely time-consuming, and is prone to false negative predictions due to the enormous image size, motivating the development of automated methods.
Deep Learning (DL) based methods have been recently adopted to improve the segmentation of high-resolution images. However, due to the high GPU memory requirements, modern DL models with sufficient complexity cannot operate on the whole ultra-high resolution images in many practical settings. To mitigate this issue, the images are typically dissected into smaller patches and/or down-sampled to fit into the available GPU memory srinidhi2019deep ; zhao2017pyramid ; zhao2017self . To exploit all the available GPU memory thus requires one to trade-off the field of view (FoV) i.e., spatial extent of context, against the spatial resolutions i.e., level of image detail. Tuning this trade-off exhaustively is expensive seth2019automated , and as a result, it is commonly set based on crude developers’ intuitions. Moreover, as seth2019automated points out, the optimal trade-off is application or even image dependent (e.g., some parts of the images may require more context than local details), and thus the existence of an “one-size-fits-all” FoV/resolution trade-off for a given application is highly questionable.
A considerable amount of work has attempted to alleviate the issue of subjectivity in tuning this trade-off by learning to merge multi-scale information, in both computer vision chen2016attention ; chen2019collaborative and medical imaging kamnitsas2017efficient . Specifically, these approaches learn representations from multiple parallel networks and then aggregate information across different scales before making the final prediction. DeepMedickamnitsas2017efficient is one pioneering example in this category, which consists of two parallel networks processing input patches of different scales, and addresses the task of lesion segmentation in voluminous multi-modal 3D MRI data. Another example is li2019classification , which is designed to work specifically with mega-pixel histopathology images. In computer vision, the authors of chen2019collaborative proposed a method for integrating multi-scale information by enforcing global and local streams to exchange information with each other, and attained the state-of-the-art performance on the segmentation of ultra-high resolution satellite images. Another notable approach chen2016attention
uses the attention mechanism to softly weight the multi-scale features at each pixel location, and demonstrate impressive performance on multiple challenging segmentation benchmarks. However, these approaches share some limitations: 1) presence of multiple parallel networks, which can be computationally expensive; 2) manual selection of a limited number (2 or 3) of scales; 3) reliance on specific choices of neural network architecture.
In this work, we first demonstrate empirically on three public segmentation datasets of ultra-high resolution images that the choice of the input patch configuration (i.e., FoV/resolution trade-off) indeed considerably influences the segmentation performance. Secondly, motivated by this finding, we then propose foveation module
, a data-driven “data loader” that learns to provide the segmentation network with the most informative patch configuration for each location in an ultra-high-resolution image. Specifically, the foveation module processes a downsampled version of the given ultra high-resolution image and estimates the distributions over multiple patches of different resolution/FoV trade-offs at respective locations. This hierarchical approach to segmentation is inspired by the way in which human annotators provide segmentation labels to an extremely large image — they saccade their gaze through the whole image and zoom in at different locations with the appropriate amount in order to acquire the required local and contextual information.
The foveation module can be trained jointly in an end-to-end fashion with the segmentation network to optimise the task performance. Additionally, our foveation module, as a learnable/adaptive data-loader, can be used to augment a wide range of existing segmentation architectures (as we show empirically). We demonstrate the general utility of our approach using three public datasets from different domains (Cityscape, DeepGlobe and Gleason2019 challenge dataset), where we show a light-weight implementation of the foveation module can boosts segmentation performance with little extra computational cost.
2 Related Work
2.1 Multiscale Architectures
Multi-scale approaches chen2014semantic ; chen2016attention ; hariharan2015hypercolumns have achieved good performances in segmentation task by either aggregating features at different layers or features of multiple inputs with different resolutions. Such aggregation process is performed in either sequential or parallel fashion. A typical sequential approach such as RefineNet lin2017refinenet fuses features from multiple inputs of different resolutions progressively starting from the lowest. Feature fusion can also be done in parallel. For example, U-net ronneberger2015u directly concatenates the low-level features from the encoder to the high-level features in the decoder via skip connections. The contracting-expanding variants of U-net is very popular noh2015learning ; badrinarayanan2017segnet ; zhao2017pyramid . A similar approach, called Feature Pyramid Network (FPN) lin2017feature fuses the features in a hiearachical fashion as in U-net, but makes predictions at respective feature levels and aggregates them into the final output.
Recently, the high-resolution network (HRNet) sun2019high was proposed to combine the merits of sequential and parallel approaches and achieved the state-of-the-art performance on the CityScapes segmentation dataset cordts2016cityscapes . HRNet merges features at different scales in parallel, and then repeats this merge in a sequential manner from higher resolution level. Another notable approach, specially designed to segment ultra-high resolutions images is GlobalLocalNet(GLNet) chen2019collaborative , where the low-level features from the massively downsampled input image and the high-level features from local patches of the original resolution are aggregated in a spatially consistent fashion.
While the above approaches have demonstrated considerable performance, they still resort to the specific designs of network architectures. In contrast, our approach can be used on a wide range of network — we will demonstrate later the benefits of augmenting UPerNet xiao2018unified (an extension of FPN lin2017feature ) and HRNet sun2019high . Furthermore, many of the above multi-scale methods use manually tuned combination of the crop size (i.e., FoV) and resolutions of the input patches, or use a set of multiple randomly chosen such combinations zhao2017pyramid ; zhao2017self ; sun2019high to ensure good performance. Here we provide a systematic analysis to confirm the considerable influence of these two factors on the segmentation performance, and present a mechanism (foveation module) to infer the optimal patch configurations at different image locations.
2.2 Hierarchical vision systems
Region-proposal approaches for object detection tasks have shown that optimal region crops vary spatially depending on local patterns. A typical region-proposal framework consists of a hierarchy of tasks of 1) proposing region crops and 2) the downstream classification girshick2014rich ; girshick2015fast or segmentation task of the cropped region he2017mask . Our approach shares a similar structure but differs in that the first task proposes a set of desirable input patch configuration (FoV/resolution-tradeoff) at different image locations to maximize downstream performance. Also our approach is end-to-end while typical region-proposal approaches train the two hierarchical tasks separately for computational efficiency. The approach that is closest to ours is “Learning to Zoom” in recasens2018learning where they introduce a learnable module to change the resolution of the input images in a spatially varying way to emphasise the salient parts of the data. However, such approach assumes that the task network still operates on the whole downsampled image, limiting its applicability to ultra-high resolution images.
2.3 Processing ultra-high resolution images in computer vision
Most CNN-based approaches do not process images with size beyond in 2D or in 3D girdhar2016learning ; choy20163d . Non-uniform grid representation is becoming a common approach in computer vision to alleviate the memory cost by modifying input images in a non-uniform fashion to improve efficiency, such as meshes wang2018pixel2mesh ; gkioxari2019mesh , signed distance functions mescheder2019occupancy , and octrees tatarchenko2017octree . Recently, Marin et al. marin2019efficient proposed an efficient semantic segmentation network based on non-uniform subsampling of the input image prior to processing with a segmentation network. However, these lines of work do not consider optimising such representations to maximise the downstream task of interest. A related but alternative strategy is proposed in katharopoulos2019processing in which attention weights are learned over the input image to sample a small informative subset for the downstream classification task. A similar approach was introduced in shen2019globally but based on the use of saliency maps for the task of breast cancer screening with high-resolution mammography images. However, these works only addressed the problem of classification and is not directly applicable to the segmentation task which requires all locations to be examined.
In this section, we first perform a motivation experiment to illustrate the impact of the patch FoV/resolution trade-off on the segmentation performance and its spatial variation across the image. Motivated by this finding, we then propose foveation module, a module that learns to provide the segmentation network with the most informative patch configuration for each location in a ultra-high-resolution image.
3.1 Patch Configuration Matters in Segmentation
The first part of our work performs an empirical analysis to investigate a key question: “How does the FoV/resolution of training input patches affect the final segmentation performance?”. To this end, we use the following three ultra-high-resolution segmentation datasets from different domains.
|Dataset||Content||Resolution (pixels)||Number of Classes|
|Cityscape cordts2016cityscapes||urban scenes||19|
|DeepGlobe DeepGlobe18||aerial scenes||6|
In the interest of space, we only show results on the Gleason 2019 histopathology dataset (see the supplementary material for similar results on DeepGlobeDeepGlobe18 and Cityscapecordts2016cityscapes ). We first train a set of different segmentation networks, each with a different combination of FoV and downsampling rate (i.e., resolution); see the small blue dots on the curve in Fig. 2
(a)). We note that the maximum tensor size of the input patch is capped atand constant along the curve in Fig. 2(a). The segmentation network with the best performance for each class is highlighted in different shaped marks (see Fig. 2(a)). It is clear that there is no “one-size-fits-all” patch configuration of the training data that leads to the best performance overall and for all individual classes. Fig. 2(b) illustrates visually the variation of the segmentation performance. In addition, even within each class, we find that the optimal patch configuration can vary between different spatial locations as shown in Fig. 2(c). These observations altogether imply that the standard patch sampling scheme with a pre-set FoV/Resolution trade-off is sub-optimal, highlighting the potential benefits of a more intelligent strategy that can adaptively select the patch of the most informative configuration to describe the local patterns at a given location.
3.2 Foveation Module for Adaptive Patch Configuration
Motivated by the above finding, we introduce foveation module, a data-driven patch sampling strategy which selects, at each spatial location in an ultra-high resolution image, an appropriate configuration (i.e., resolution and FoV) of the local patch that feeds into the segmentation network. The inspiration for our method roots from the ways in which human experts segment high-resolution images — starting from a low-resolution bird’s-eye view of the whole image 111Screen display or human vision typically have lower resolutions than that of the ultra-high resolution images of interest in this work., the annotators navigate their gaze through different locations and zoom in to the right extent to collect both local and contextual information. The magnification scale is controlled by, what is called, foveation (i.e., the process of adjusting the focal length of the eye, the distance between the lens and fovea). The proposed adaptive patch-selection scheme is akin to this process, and hence the name, foveation module. Foveation module can be seen as a learnable “dataloader” that is optimised to maximise the performance of a given segmentation network.
Fig.3 provides a schematic of the proposed method. Foveation module takes a low-resolution version of a mega-pixel input image and generates importance weights over a set of patches with varying spatial FoV/resolution at different pixel locations. Then, the segmentation network
processes the input patches based on the outputs of the foveation module, and estimates the corresponding segmentation probabilities. Note that the choice of this segmentation network only requires them to operate on a single input patch/image—we will later demonstrate this by augmenting two recent and different architectures, namely UPerNetxiao2018unified (Pyramid Pooling + FPN head) and HRNet sun2019high .
More specifically, for each mega-pixel image where H, W, C denote the height, width and channels respectively, we compute its lower resolution version . We also define a “patch-extractor” function that extracts a set of patches of varying field-of-view/resolution (but the same number of pixels) from the full resolution image I centered at the corresponding pixel in (see Fig. 3 for a set of examples). Foveation module, , parametrised by , takes the low-resolution image
as the input and generates the probability distributionsover patches at respective spatial locations in . In other words, the values of at the pixel define a
-dimensional probability vectorover the extracted patches. We then select the input patch by sampling from this patch distribution:
and feed it to the segmentation network, , parametrised by to estimate the corresponding segmentation probabilities. We note here that the spatial extent of the predicted segmentation corresponds to the area covered by the input patch with the smallest field of view in .
During training, we would like to encourage the foveation module to output meaningful probabilities over patches at different spatial locations. In particular, the patch probabilities are desired to reflect the relative “informativeness” of different patch configurations at each location for the downstream segmentation task. To this end, we jointly optimise the parameters
of both the foveation module and the segmentation network to minimise the segmentation specific loss function,
(e.g., cross entropy + L2-weight-decay). However, due to the non-differentiable nature of patch sampling from discrete distributions, one cannot naively apply stochastic gradient descent. We devise and evaluate several solutions as detailed in Sec.3.3. We also note that, for computational efficiency, for each mega-pixel image , we randomly select a subset of pixels from its low resolution counterpart , compute the corresponding input patches, feed them to the segmentation network and compute the losses.
At inference time, we segment the whole mega-pixel image by aggregating predictions at different locations . To speed up the inference, we also optionally downsample to reduce the number of locations over which to sample patches as explained in detail in Sec. 4.
3.3 Learning the Spatial Distribution of Patch Configurations
The sampling of input patches from discrete distributions in eq. (1) creates discontinuities giving the objective function zero gradient with respect to , the parameters of the foveation module. We devise the following approximations to address this problem.
Gradient estimation with Gumbel-Softmax:
In this approach, we approximate each discrete distribution, by the so-called Concrete maddison2016concrete or Gumbel-Softmax distribution jang2016categorical , denoted by . GSM is a continuous relaxation which allows for patch sampling, differentiable with respect to the parameters of the foveation module through a reparametrisation trick. The temperature term
adjusts the bias-variance trade-off of gradient approximation; as the value ofapproaches 0, samples from the GSM distribution become one-hot (i.e. lower bias) while the variance of the gradients increases. In practice, we start at a high and anneal to a small but non-zero value as in jang2016categorical ; gal2017concrete ; bragman2019stochastic as detailed in supplementary materials.
We have also experimented with another popular gradient estimator, REINFORCE williams1992simple
. The early experiments suggested that the optimisation was challenging likely due to the well-known high variance of the estimator. One could also use more sophisticated, unbiased estimators with low-variance such as REBARtucker2017rebar and RELAX grathwohl2017backpropagation , but the GSM-based approximation (although only unbiased in the limit of ) worked well for the initial experiments in this paper.
Instead of sampling patches at each pixel location , we compute the average input patch weighted by the estimated probabilities and feed it to the segmentation network:
Here denotes the value of at pixel, and quantifies the “importance” of the patch at that location. With this approach, the objective function becomes fully differentiable with respect to . Such mean approximations are commonly employed in the attention literature, and have shown efficacy in different contexts e.g., the multi-instance learning for ultra-high resolution images ilse2018attention
and the “deterministic soft” attention for image captioning inxu2015show .
We also experiment with the option to select the most probable patch:
Since the gradient is not well-defined in this case, we approximate it with a straight-through estimator bengio2013estimating
which directly copies the gradient from the preceding layer. Such an approach, while biased, has also been shown effective in learning discrete representations in VAE-type generative models of high-dimensional datavan2017neural in the presence of an function.
4 Experiments and Results
In this section, we evaluate the performance of our foveation approach on three datasets against baselines described in 4.1. We show that our foveation module can learn the spatial variation of optimal trade-offs in 4.3, leading to better performance in 4.2. Different variants of our method i.e., Gumbel-softmax, mean and mode approximations, are referred to as ‘Ours-GSM’, ‘Ours-Mean’ and ‘Ours-Mode’ respectively.
We employ the same training scheme unless otherwise stated. The foveation module
is defined as a small CNN architecture comprised of 3 convolution layers plus a softmax layer. The segmentation network was defined as a deep CNN architecture, with HRNetV2-W48sun2019high used in Cityscape and Gleason2019 dataset, and UPerNet xiao2018unified used in DeepGlobe dataset. In all settings, extracts, at each image location, a set of patches of varying FoV/resolution-tradeoffs. Full details are provided in Supplementary Material.
We compare our method against a variety of baselines. Firstly, we consider the same segmentation networks (HRNet or UPerNet), but trained on input patches of fixed FoV/resolution-tradeoffs—there are five of them since we consider 5 different patch configurations. We also include the results from ensembling these five baselines (“Ensemble”). To further investigate the benefits of learning the probabilities over patch configurations, we also compare against the cases trained with randomly sampled input configuration (“Random”) or the average input patches with equal weights (“Average”). More details are in the Supplementary Material.
4.2 Quantitative and Qualitative Comparison
The quantitative comparison between our methods and the baselines is given in Table 2. We show segmentation performance for all classes in DeepGlobe and Gleason2019 dataset, while only all classes average is shown for Cityscape due to space limit (class-wise performance provided in Supplementary Material instead). Table 2 shows that our methods (especially ‘ours-Mean’) generally shows improvement over the baselines, illustrating the benefits of using more desirable FoV/Resolution at each location. It’s also worth noting that 6% boost in mIoU is achieved for Cityscape. Fig. 4 & 5 illustrate these numerical improvements reflect meaningful differences in segmentation quality.
Our approach also achieves favourable results with respect to the published results. Firstly, Table 2 shows that our approach provides 2.4% boost over comparable SoTA chen2019collaborative on the DeepGlobe dataset with visually noticeable differences in segmentation quality as shown in Fig. 4), providing finer details in the coastal area and less miss-classification of agriculture in sub-area (b). Secondly, on Gleason2019, our model achieves better segmentation accuracy for the two most clinically important and ambiguous classes (Gleason Grade 3 and 4) than the top performers in the challenge by 13.1% and 7.5%, and improves on the average performance of 6 human experts by 6.5% and 7.5% (details are provided in Supplementary Material).
Among the three different variants of our method, the mean approximation seems to be the most effective approach, achieving top performance for most single class cases. This is likely due to the optimisation since GSM and mode approximations rely gradient estimators. We do note, however, that in some cases, GSM or mode approximations perform better than the mean approximation (e.g. "Forest" and "Barren" in DeepGlobe of Table 2). The third row in Fig. 5 is a visual example on this point.
|Class||All||U.||A.||R.||F.||W.||B.||All||Benign||Grade 3||Grade 4||All|
|Clinical experts Gleason2019||-||-||-||-||-||-||-||66.9||83.9||56.4||60.3||-|
4.3 Evaluation of Foveation Module
In this section, we demonstrate that our foveation module can effectively learn spatial distribution of the FoV/Resolution trade-off. To visualise the learnt trade-off at each location, we use the probability predicted from foveation module at each pixel in to calculate the weighted average FoVs of the given set of patches, which we refer to as “foveation map”. In order to measure the quality of such foveation maps, we define a ‘Gold Standard’ by visualising the FoV of the most performant patch configuration amongst the baselines trained fixed patches. It is, however, worth noting that such gold standard does not equate to the true optimal patch configuration of the given segmentation model and the dataset since only a limited set of patch sizes are evaluated. Details are provided in Supplementary Materials.
|Ours-Mean Approximation||Ours-GSM||Ours-Mode Approximation|
Mean Squared Error (MSE) between map of mIoU and foveation output weighted average FoV, with three ours approaches and over three datasets. Mean and standard deviation of MSE are calculated across all validation images in each dataset.
Fig. 6 visualises the foveation maps from different variants of our method on examples from two datasets. In general, ‘Ours-Mean’ and ‘Ours-GSM’ predicted foveation maps visually similar to the corresponding ’Gold Standard’, illustrating that the foveation module has generally learned to provide higher resolution patches where needed. This aligns well with our motivation as explained in Fig. 1, that single fixed path size is not optimal across different spatial locations. We also report the mean-squared-error (MSE) between the learnt foveation and the gold standard in Table. 3. In general the results are consistent with the qualitative observations shown in Fig. 6.
In this work, we propose a new approach for segmenting ultra-high resolution images. In particular, we introduce foveation module, a learnable “dataloader” which, for a given image, adaptively provide the downstream segmentation model with input patches of the appropriate configuration (FoV/resolution trade-off) at different locations. Such dataloader is trained end-to-end together with the segmentation model to maximise the task performance. We show our approach can improve performance consistently on three public datasets. Our method is simple to implement, requiring simply the addition of the foveation module to an existing segmentation network. A key limitation of the current approach is that only a discrete set of patches are considered — future work aims to extend this to the continuous settings by adapting the learnable deformation ideas such as recasens2018learning ; dalca2019learning to extremely high-resolution images.
We acknowledge Hongxiang Lin for the insightful discussions, Marnix Jansen for his clinical advice. We are also grateful for the EPSRC grants EP/R006032/1, EP/M020533/1, the CRUK/EPSRC grant NS/A000069/1, and the NIHR UCLH Biomedical Research Centre which supported this research
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.
The cityscapes dataset for semantic urban scene understanding.In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
-  Emmanuel Maggiori, Yuliya Tarabalka, Guillaume Charpiat, and Pierre Alliez. Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pages 3226–3229. IEEE, 2017.
-  Chetan L Srinidhi, Ozan Ciga, and Anne L Martel. Deep neural network models for computational histopathology: A survey. arXiv preprint arXiv:1912.12378, 2019.
-  Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
-  Jian Zhao, Jianshu Li, Xuecheng Nie, Fang Zhao, Yunpeng Chen, Zhecan Wang, Jiashi Feng, and Shuicheng Yan. Self-supervised neural aggregation networks for human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 7–15, 2017.
-  Nikhil Seth, Shazia Akbar, Sharon Nofech-Mozes, Sherine Salama, and Anne L Martel. Automated segmentation of dcis in whole slide images. In European Congress on Digital Pathology, pages 67–74. Springer, 2019.
-  Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L Yuille. Attention to scale: Scale-aware semantic image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3640–3649, 2016.
-  Wuyang Chen, Ziyu Jiang, Zhangyang Wang, Kexin Cui, and Xiaoning Qian. Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8924–8933, 2019.
-  Konstantinos Kamnitsas, Christian Ledig, Virginia FJ Newcombe, Joanna P Simpson, Andrew D Kane, David K Menon, Daniel Rueckert, and Ben Glocker. Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical image analysis, 36:61–78, 2017.
-  Yuqian Li, Junmin Wu, and Qisong Wu. Classification of breast cancer histology images using multi-size and discriminative patches based on deep learning. IEEE Access, 7:21400–21408, 2019.
-  Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014.
-  Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 447–456, 2015.
-  Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1925–1934, 2017.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
-  Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015.
-  Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.
-  Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
-  Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514, 2019.
-  Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), pages 418–434, 2018.
-  Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
-  Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
-  Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
-  Adria Recasens, Petr Kellnhofer, Simon Stent, Wojciech Matusik, and Antonio Torralba. Learning to zoom: a saliency-based sampling layer for neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 51–66, 2018.
-  Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning a predictable and generative vector representation for objects. In European Conference on Computer Vision, pages 484–499. Springer, 2016.
-  Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision, pages 628–644. Springer, 2016.
-  Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 52–67, 2018.
-  Georgia Gkioxari, Jitendra Malik, and Justin Johnson. Mesh r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 9785–9795, 2019.
-  Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4460–4470, 2019.
-  Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In Proceedings of the IEEE International Conference on Computer Vision, pages 2088–2096, 2017.
-  Dmitrii Marin, Zijian He, Peter Vajda, Priyam Chatterjee, Sam Tsai, Fei Yang, and Yuri Boykov. Efficient segmentation: Learning downsampling near semantic boundaries. In Proceedings of the IEEE International Conference on Computer Vision, pages 2131–2141, 2019.
-  Angelos Katharopoulos and François Fleuret. Processing megapixel images with deep attention-sampling models. arXiv preprint arXiv:1905.03711, 2019.
Yiqiu Shen, Nan Wu, Jason Phang, Jungkyu Park, Gene Kim, Linda Moy, Kyunghyun
Cho, and Krzysztof J Geras.
Globally-aware multiple instance classifier for breast cancer screening.In
International Workshop on Machine Learning in Medical Imaging, pages 18–26. Springer, 2019.
-  Ilke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raskar. Deepglobe 2018: A challenge to parse the earth through satellite images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
-  Gleason 2019 challenge. https://gleason2019.grand-challenge.org/Home/. Accessed: 2020-02-30.
-  Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
-  Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
-  Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in neural information processing systems, pages 3581–3590, 2017.
-  Felix JS Bragman, Ryutaro Tanno, Sebastien Ourselin, Daniel C Alexander, and Jorge Cardoso. Stochastic filter groups for multi-task cnns: Learning specialist and generalist convolution kernels. In Proceedings of the IEEE International Conference on Computer Vision, pages 1385–1394, 2019.
Ronald J Williams.
Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3-4):229–256, 1992.
-  George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pages 2627–2636, 2017.
-  Will Grathwohl, Dami Choi, Yuhuai Wu, Geoffrey Roeder, and David Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. arXiv preprint arXiv:1711.00123, 2017.
-  Maximilian Ilse, Jakub M Tomczak, and Max Welling. Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712, 2018.
-  Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
-  Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
-  Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, pages 6306–6315, 2017.
-  Adrian Dalca, Marianne Rakic, John Guttag, and Mert Sabuncu. Learning conditional deformable templates with convolutional networks. In Advances in neural information processing systems, pages 806–818, 2019.
-  Simon K Warfield, Kelly H Zou, and William M Wells. Simultaneous truth and performance level estimation (staple): an algorithm for the validation of image segmentation. IEEE transactions on medical imaging, 23(7):903–921, 2004.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Wei Liu, Andrew Rabinovich, and Alexander C Berg. Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
Appendix A Details of Datasets, Architectures, Training and Baselines
In this work, we verified our method on three segmentation datasets: DeepGlobe aerial scenes segmentation dataset , CityScape urban scenes segmentation dataset  and Gleason2019 medical histopathological segmentation dataset .
The DeepGlobe  dataset has 803 high-resolution ( pixels) images of aerial scenes. There are 7 classes of dense annotations, 6 classes among them are used for training and evaluation according to . We randomly split the dataset into train, validate and test with 455, 207, and 142 images respectively.
The CityScape  dataset contains 5000 high-resolution ( pixels) urban scenes images collected across 27 European Cities. The finely-annotated images contain 30 classes, and 19 classes among them are used for training and evaluation according to . The 5000 images from the CityScape are divided into 2975/500/1525 images for training, validation and testing.
The Gleason2019  dataset contains 322 high-resolution ( pixels) medical histopathological images. Each image is finely-annotated by a subset of 6 annotators (all experts), which annotates each pixel into one from four classes (Benign, Gleason Grade 3,4,5). Preprocessing for emprical analysis: A subset of 298 training examples have been used in the first part of our work. We fuse 6 annotations into 1 using pixel-level probabilistic analysis by STAPLE . Each image is paired with 1 STAPLE fused annotation as gold standard (to be used as ground truth during training and evaluation). Pre-processing for foveation experiments: on top of 298 pre-processed images in empirical analysis, we extract the central part of input images of size pixels from the original histology images to ensure input images have constant size.
a.2 Network Architectures and Implementation Details
Architectures: The foveation module is defined as a small CNN architecture comprised of 3 convolution layers, each with
kernels follower by BatchNorm and Relu. The number of kernels in each respective layer is. A softmax layer is added at the end. All convolution layers are initialised following He initialization . The Segmentation module was defined as a deep CNN architecture , with HRNetV2-W48  applied in Cityscape and Gleason2019 dataset, and UPerNet  applied in DeepGlobe dataset (details provided in the original literature). The segmentation network HRNetV2-W48 is pre-trained on Imagenet dataset as provided by the author .
Foveation module specific: In all settings, Patch-Extractor extracts, at each location , a set of 5 patches at different FoVs of for DeepGlobe, for Cityscape and for Gleason2019), all patches in each dataset are downsampled to the same size as smallest patch in the original resolution. When generating low-resolution counterparts , downsampling rates of 1/24, 1/16, 1/44 are applied for DeepGlobe, Cityscape and Gleason2019 respectively, for both training and inference, unless otherwise stated.
Minibatch construction: In our experiments, each minibatch of size consists of a set of patches extracted from each of different mega-pixel images (i.e., . More specifically, for each mega-pixel image , we select a subset of locations from its low resolution counterpart and extract patches from them. Therefore for a set of mega-pixel images, a total number of patches are sampled and passed to the segmentation network. We keep the total number of patches, fixed in each dataset ( for DeepGlobe and Cityscape, and Gleason2019) to max out the available GPU resources, while we selected the combination of and according to the validation performance for each model instance.
For all experiments, we employ the same training scheme unless otherwise stated. We optimize parameters using Adam  with initial learning rate of and
, and train for 50 epochs on DeepGlobe and Gleason2019 dataset, and 100 epoches on Cityscape dataset. Following prior works, we adopt ‘poly’ learning rate policy and the power is set to 0.9. We set maximum iteration number to 410K for experiments on the DeepGlobe dataset, 595K for Cityscapes and 95K for Gleason2019. Segmentation networks are trained on 2 or 4 GPUs with syncBN. The temperature term in the Gumbel-Softmax gradient estimator is annealed by scheduler as recommended in , where is the annealing rate and is the current training iteration. We used ’total iteration number’ for our models.
For each of the three datasets, we compare our method against a variety of baselines. Firstly, we consider the same segmentation networks (HRNet or UPerNet), but trained on input patches of fixed FoV/resolution-tradeoffs. Each case was implemented by setting the outputs of the foveation module to a fixed one-hot vectors, thus selecting only one scale per baseline from the given set of 5 patches. Additionally, on DeepGlobe and Gleason2019 datasets, to illustrate our foveation approach do better than random guess or uniform average, two additional baselines are also applied: a uniform random one-hot baselines that randomly select one from the given set of 5 patches with varying FoV/resolution, and an average baseline that assign equal probability of 1/5 over the set of 5 patches with varying FoV/resolution. Lastly, to examine the robustness of our approach, we also calculate the performance of the ensemble model by averaging the five one-hot baseline predictions.
Appendix B Additional Results
b.1 Patch Configuration Matters in Segmentation
To further illustrate the impact of the patch FoV/resolution trade-off on the segmentation performance and its spatial variation across the image. Fig. B.1.1 shows the results of the same motivational experiment as in the main text Fig. 2(a), but on the DeepGlobe and Cityscape datasets. For each data set, we first train a set of different segmentation networks, each network with a different combination of FoV and downsampling rate (i.e., resolution); see the small blue dots on the curve in Fig. B.1.1. We note that the maximum tensor size of the input patch is capped at for DeepGlobe and for Cityscape, and constant along the curve in Fig. B.1.1. The segmentation network with the best performance for each class is highlighted in different shaped marks (see Fig. B.1.1). It shows again that there is no “one-size-fits-all” patch configuration of the training data that leads to the best performance overall and for all individual classes.
b.2 Additional Quantitative Comparison on Cityscape and Gleason 2019
The class-wise segmentation performance on Cityscape are given in Table B.2.1 and Table B.2.2. These results show that our approach improved fixed patch size baselines by large margin for both overall average mIoU (+6%) and most single class IoU.
|Class||All||road||sidewalk||building||wall||fence||pole||traffic light||raffic sign||vegetation|
As declared in Section 4.2, on Gleason2019 data set, we compare with published results as follows: we compare ours approaches with the top 2 results on the Gleason2019 challenge leaderboard (https://gleason2019.grand-challenge.org/Results/) ranked by overall all classes average segmentation accuracy. We also collect highest segmentation accuracy of each classes as a third Single-Class-Best case for comparison. We quantify segmentation performance via pixel accuracy for each class, to be consistent and comparable against results released on the leaderboard. It worth to note that for all results, we remove Gleason Grade 5 in evaluation it is under represented - only 2% pixels in the given dataset. The results are shown in Table B.2.3, where our model achieves better segmentation accuracy against the top performers in the challenge for the two most clinically important and ambigous classes (Gleason Grade 3 and 4) by 13.1% and 15.7%.
|Experiment||Benign||Grade 3||Grade 4|
b.3 Additional Qualitative Comparison
b.4 Evaluation of Foveation Module
Here we give details on calculating foveation map and the performance weighted ’Gold standard’. For one input image , at each pixel in , we use the foveation output, a D-dimensional probability vector , to weight average the given D FoVs (corresponds to the given set of D patches), referred to as weighted average FoVs. To visualise the spatial distribution of FoV/Resolution Trade-off, we calculate and plot weighted average FoVs over all location in and refer it as the "foveation map". For consistency and better visualisation, we apply minimum-maximum normalisation to 0-1 based on the smallest and largest FoV in the given set of D patches, where 1 indicting largest FoV-lowest resolution, and 0 indicting smallest FoV-highest resolution.
We also use the similar strategy to visualise FoVs weighted by performance (mIoU). Specifically, at each pixel in , we use the segmentation performance (mIoU) from the five fixed patch baselines to weight average the FoVs of the given set of patches and plot in the same way as foveation map, which is referred to as ’Gold Standard’ to compare with foveation map. It is worth noting that this ’Gold Standard’ is not real ground truth as only limited patch size are evaluated.
To show foveation module is effectively learning, we plot foveation map at different training epoches, Fig. B.4.4 shows the same example as main text Fig.4 from DeepGlobe dataset and it shows 1) foveation module is progressively refining the learnt spatial distribution of FoV/resolution trade-off; 2) the third row of Fig. B.4.4 explains why our foveation approach perform better than GLnet at sub-image (b) as shown in Fig.4, because it shows foveation module learnt to invest small patch at high resolution over coastal areas to capture fine scale context to avoided miss classify it as forest which sharing similar green colour as sea in the input image, as GLnet do.
b.5 Sensitivity to Hyper-parameters
Here we first evaluate the sensitivity of our approach to the resolution of at inference time on CityScape dataset, since it forms the most significant computational bottleneck in our approach. Here we apply the same downsampling rate of 1/16 to get at training time, while at inference time evaluate all models at different downsampling rate to get , with mIoU measured on validation given in Table B.5.1. In general, Table B.5.1 shows 1) the validation mIoU are increasing with increasing resolution of ; 2) ours approaches gain more mIoU boost with increasing resolution of . To better visualise the results, Fig. B.5.1 plotted Table B.5.1, it shows 1) foveated approachs gain more than fixed patch size baselines on mIoU as resolution of increasing; 2) the gain start converge at downsampling rate of 1/128, at which evaluation time could be reduced to 1/64 comparing to at highest resolution with downsampling rate of 1/16, thus dramatically improve our inference efficiency.
|Experiments/ downsample rate||1/512||1/256||1/128||1/64||1/16|
Appendix C Pseudo-codes
Here we provide pseudo-codes of our method for the case where each mini-batch is constructed based on a single mega-pixel image (i.e., as described in Sec. A.2). We also intend to clean up the whole codebase and release in the final version.