Dense Steerable Filter CNNs for Exploiting Rotational Symmetry in Histology Images

by   Simon Graham, et al.

Histology images are inherently symmetric under rotation, where each orientation is equally as likely to appear. However, this rotational symmetry is not widely utilised as prior knowledge in modern Convolutional Neural Networks (CNNs), resulting in data hungry models that learn independent features at each orientation. Allowing CNNs to be rotation-equivariant removes the necessity to learn this set of transformations from the data and instead frees up model capacity, allowing more discriminative features to be learned. This reduction in the number of required parameters also reduces the risk of overfitting. In this paper, we propose Dense Steerable Filter CNNs (DSF-CNNs) that use group convolutions with multiple rotated copies of each filter in a densely connected framework. Each filter is defined as a linear combination of steerable basis filters, enabling exact rotation and decreasing the number of trainable parameters compared to standard filters. We also provide the first in-depth comparison of different rotation-equivariant CNNs for histology image analysis and demonstrate the advantage of encoding rotational symmetry into modern architectures. We show that DSF-CNNs achieve state-of-the-art performance, with significantly fewer parameters, when applied to three different tasks in the area of computational pathology: breast tumour classification, colon gland segmentation and multi-tissue nuclear segmentation.



page 1

page 3

page 5

page 7

page 9

page 10

page 11


Learning Steerable Filters for Rotation Equivariant CNNs

In many machine learning tasks it is desirable that a model's prediction...

Local Rotation Invariance in 3D CNNs

Locally Rotation Invariant (LRI) image analysis was shown to be fundamen...

Roto-Translation Equivariant Super-Resolution of Two-Dimensional Flows Using Convolutional Neural Networks

Convolutional neural networks (CNNs) often process vectors as quantities...

Rotation Equivariant CNNs for Digital Pathology

We propose a new model for digital pathology segmentation, based on the ...

Learning Sparse High Dimensional Filters: Image Filtering, Dense CRFs and Bilateral Neural Networks

Bilateral filters have wide spread use due to their edge-preserving prop...

A General Framework For Proving The Equivariant Strong Lottery Ticket Hypothesis

The Strong Lottery Ticket Hypothesis (SLTH) stipulates the existence of ...

Land cover mapping at very high resolution with rotation equivariant CNNs: towards small yet accurate models

In remote sensing images, the absolute orientation of objects is arbitra...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The recent advances in the analysis of Haematoxylin & Eosin (H&E) stained whole-slide images (WSIs) can largely be attributed to the rise of digital slide scanning [1]. In particular, Convolutional Neural Networks (CNNs) leverage the prior knowledge that images have translational symmetry and utilise a weight sharing strategy, which guarantees that a translation of the input will result in a proportional translation of the features. This property, known as translation equivariance, is an inherent property of the CNN and removes the need to learn features at all spatial locations, significantly reducing the number of learnable parameters. In certain image analysis applications, where there is no global orientation, it is desirable to extend this property of equivariance beyond translation to also rotation. One such example is the field of computational pathology (CPath) where important image features can appear at any orientation (Fig. 1). Therefore, we should be able to learn those features, regardless of their orientation. In the absence of rotation-equivariance, data augmentation is typically used, where multiple rotated copies of the WSI patches are usually introduced to the network during the train process. However, the augmentation strategy requires many more parameters in order to learn weights of different orientations. Instead, encoding rotational symmetry as a prior knowledge into current deep learning architectures requires fewer parameters and leads to an overall superior discriminative ability.

Fig. 1: Cropped circular regions from a whole-slide image. Each orientation is equally as likely to appear.

CPath is ripe ground for the utilisation of rotation-equivariant models, yet most models fail to incorporate this prior knowledge into the CNN architectures. Inspired by recent developments in the study of rotation-equivariant CNNs [2, 3, 4, 5], we propose Dense Steerable Filter based CNNs (DSF-CNNs) that integrate steerable filters [6] with group convolution [2] and a densely connected framework [7] for superior performance. Each filter is defined as a linear combination of circular harmonic basis filters, enabling exact rotation and significantly reducing the number of parameters compared to standard filters. The main contributions of this work are listed as follows:

  • A Dense Steerable Filter CNN that achieves rotation-equivariance by integrating steerable filter group convolutions within a densely connected network.

  • The first thorough comparison of multiple rotation-equivariant approaches for CPath.

  • We demonstrate state-of-the art performance across multiple histology image datasets with far fewer parameters.

Ii Related Work

Ii-a CNNs for translation equivariance

Pioneered by LeCun et al. in 1994 [8], CNNs inherently incorporate prior knowledge of translation symmetry in images and achieve translation equivariance by re-using filters at all spatial locations. Therefore, a shift of the input leads to a proportional shift of the filter responses. This design drastically reduces the number of required parameters because features do not need to be learned independently at each location. Since the increase in computing power and the development of algorithms that assist network optimisation [9], CNNs have become deeper [10, 7], leading to current state-of-the-art performance in numerous image recognition tasks [11, 12]. As a result of the success of deep learning, CNNs have since been widely used in CPath for various tasks including: gland segmentation [13, 14]; nucleus segmentation [15, 16, 17]; mitosis detection [18]; cancer type prediction [19] and cancer grading [20, 21]. Yet, unlike translation, CNNs do not behave well with respect to input rotation because this symmetry is not built into the network architecture.

Ii-B Exploiting rotational symmetry

Rotating the data: It is well known that histology images have no global orientation and therefore standard practice is to apply rotation augmentation to the training data [22]. This improves performance, but requires many parameters and is therefore prone to overfitting. Also, there is no guarantee that CNNs trained with rotation augmentation will learn an equivariant representation and generalise to data with small rotations [23]

. To reduce the variance of predictions of multiple orientations, test-time augmentation (TTA) can be used

[24]. However, with TTA inference time scales linearly with the number of augmented copies. TI-Pooling [25] utilises multiple rotated copies of the input in a twin network architecture, where a pooling operation over orientations is performed to find the optimal canonical instance of the input images for training. However, like TTA, TI-Pooling is computationally expensive.

Rotating the filters: Cohen & Welling [2] pioneered group equivariant CNNs (-CNNs), where the convolution was generalised to share weights over additional symmetry groups beyond translation. However, they limited the filter transformation to 90 rotations and horizontal/vertical flips to ensure exact transformations on the 2D pixel grid. Veeling et al. [26] showed that these -CNNs can be used to improve the performance of metastasis detection in breast histology images. Furthermore, Linmans et al. [27] and Graham et al. [28] extended the application of the -CNNs proposed by Cohen & Welling to pixel-based segmentation in histology images, highlighting an improved performance over conventional CNNs. The symmetries of a square grid are limited to integer translations extended by the dihedral group of order 8 (4 reflections and 4 rotations). To counter the limitation of working wih square grids in the -CNN, Hoogeboom et al. [29] used hexagonal filters. However, this strategy requires images to be resampled on a hexagonal lattice, which is an additional overhead. Instead of using exact filter rotations, Bekkers et al. [30] and Lafarge et al. [5] applied

-CNNs to several medical imaging tasks by rotating filters with bilinear interpolation. Therefore, this method was not restricted to rotations by multiples of 90

, but may introduce interpolation artefacts. Oriented response networks [31] use active rotating filters during the convolution that explicitly encodes location and orientation information within the feature maps.

The aforementioned methods carry forward the feature maps for each orientation throughout the network. Instead, Marcos et al. [4]

converted the output of multiple convolutions with rotated filter copies to a vector field by considering the magnitude and angle of the highest scoring orientation at every spatial location, leading to more compact models. To help overcome the issue of inexact filter rotation, the method only considered parameters at the centre of each filter and therefore required larger filters and consequently more parameters.

Rotating the feature maps: Dieleman et al. proposed a method similar to the -CNN, but instead of rotating the filters, the feature maps were rotated. This design choice has no effect on the equivariance, yet any rotation that is not a multiple of 90 may suffer from interpolation artefacts.

Steerable filters: CNNs that encode rotation-equivariance are typically only equivariant to discrete rotations. To achieve full 360 equivariance, Worrall et al. [32] used the concept of steerable filters [6] and constrained the weights to be complex circular harmonics. Despite features being equivariant to continuous rotations, constraining the weights in this way may hinder the expressive capacity of the CNN. Weiler et al. [3] learned steerable filters as a linear combination of atomic basis filters, which enabled exact filter rotation within -CNNs. Then, these steerable filters were used within the group convolution to enable the network to be equivariant to rotation. Weiler and Cesa [33] then performed an extensive comparison of rotation-equivariant models using steerable filters. Our method builds on the approach proposed by Weiler et al. [3], by incorporating steerable filter group convolutions into a densely connected framework for superior performance.

Iii Mathematical Framework

In this section we present the key mathematical concepts used in our framework. We first describe the actions of images and feature maps as functions. Then, we introduce steerable filters and describe the group-convolution (-convolution) operation with these filters. We then describe how using this operation leads to -equivariance. Below, we deal with a single filter at a time, although the method needs a whole filter bank to be used. We mainly follow the method described by Weiler et al. [3], but we use a different formulation. We encourage readers to read both approaches for a thorough understanding.

Iii-a Images and feature maps as functions

We model an image as a map with bounded support111The support of is the smallest closed subset of containing .. We take the domain of to be , rather than the equivalent , because rotations are more simply discussed on .

We denote by the group of euclidean isometries of the plane. Each element of can be written in the form or , where and . For simplicity and easier comparison with other works on rotations in CNNs, we decided not to consider flips in this paper. We denote by the group of transformations of the form .

Let be the vector space over of all , with bounded support. If and , we define by:


In keeping with Equation 1, we define the action of on by:


We will sometimes use alternative notations: and .

Our filters will sometimes be complex-valued, in which case, our feature maps may also be complex valued. Let denote the vector space of all functions with bounded support.

In practice, will not be a function, but an array with real entries (or an array, for an RGB image). To discuss this, we denote by the set of all pairs of integers. One might think that it would be more appropriate to model an image by a map . The problem with this is that the group of isometries of is too small. For example, the only rotations fixing are rotations through multiples of . Since rotations by more general angles are clearly essential in the study of histology or astronomy images, we are obliged to use , or if we decide not to study flips.

Suppose that is rotation through an angle of fixing , where is a positive integer, not equal to 1, 2 or 4. Then one can prove that the smallest closed subgroup of containing both and all integer translations is the semidirect product , where is the cyclic group of order , consisting of angles . When (the trivial representation), then no rotation is used, and we use {} instead of . This indicates that, if we want to deal conveniently with rotations of images, then we will have to deal mathematically with the semidirect product.

Iii-B Steerable functions and filters:

The group acts linearly on (over ) and on (over ). The action by is given by

Explicitly, this means:
1) If where , then ;
2) for all ;
3) , for all , and . There is a similar statement for real scalars.

We define to be the complex vector subspace spanned by the orbit . may be infinite dimensional over —an example is given by , where and the radial profile satisfies simple conditions that are discussed immediately after Equation 3.

If is a finite dimensional vector space, we say that is steerable. This applies both to and to .

Theorem: A necessary and sufficient condition for to be steerable is that there should exist an integer , and radial profile functions for and , such that, in polar coordinates:


where some or all of the radial profile functions may be identically zero. In order to be sure that this expression is meaningful, we insist that there should be a , such that, if , then, for all , . This ensures that has bounded support. The equality , where and defines and , except when . In that case is defined, but is not. For this reason, we insist that , unless , when the angle plays no role.

If satisfies Equation 3, then is clearly finite dimensional. The reverse implication takes a bit longer to argue, but easily follows standard theorems in Group Representation Theory.222For full mathematical rigour, the theorem requires the additional hypothesis that, for each , is a continuous function of . See also [34] for more technical details.

Fig. 2 is a graphical representation of basis harmonic filters that appear in Equation 3.

Fig. 2: Example circular harmonic basis filters sampled on the 1111 square grid. Red and blue borders denote the real and imaginary parts respectively. Each pair of images comes from a single term in Equation 3. In this Fig., the particular radial profile functions are all Gaussians, as they are in our proposed model. These Gaussians have mean/mode/max equal to . The integer specifies the frequency.

Real Version: In practice we will work with steerable real-valued filters. Since a real-valued steerable filter is also a complex-valued steerable filter, we can apply Equation 3 to obtain, in the same notation:

Now . It follows that we can write instead (but the radial profiles change):


where and, for , .

Iii-C -convolutions and classical convolutions:

Convolutions appear in multiple guises throughout Mathematics, Applications of Mathematics and Computer Science, and in particular in the study of CNNs (Convolutional Neural Networks). We start by giving the definition -convolution, where is a group with a measure —this means that, given , we can form the integral, denoted by or . We will stick to the unimodular case, which is general enough for all cases of interest in this paper. The word unimodular

means that we can change the dummy variable

in the integral to , or or ( constant), without changing the value of the integral. In more familiar terms, to check that the additive group is unimodular, note that, when computing , we can change the variable to or to or to without changing the value of the integral.

Given a maps and , we define their -convolution by


The first equality is a definition; the second follows by change of variable.

-convolution is automatically -equivariant. To see this, note that, for any ,


It follows that

Equivariance can help in making the CNN “black box” less opaque.

Classical convolutions are special cases of -convolutions. For example, with , the measure is standard euclidean area, and the group law is standard addition. In the case of matrices, the group and the measure is standard counting measure, so that the integral becomes a sum. We denote classical convolution with ’’, and use ’’ to denote -convolution.

Iii-D -convolution and -equivariance:

We now show, following the pioneering work of Cohen and Welling [2] and of Weiler et al. [3], how to deal with the difficulties explained at the end of Subsection III-A. We need to move from convolutions to -convolutions.

The group that we work with will include all translations of and sufficiently many rotations so that we manage to span the whole circle of rotations with sufficient resolution. We also need to avoid asking for more resolution than our data can provide, as this will result in wasted time and effort and may introduce artifacts and overfitting.

Let be the group of rotations by angles of the form , where . Let .

Then, as a space is the disjoint union of copies of , where

The measure on is given by using the usual euclidean (area) measure on each . Note that this particular group is unimodular, explained at the beginning of Subsection III-C, because rotation is measure preserving on the plane.

Except for the input layer, all convolutions in our network are -convolutions. The convolution of two functions is another function with domain (see Equation 5). So the first task for the network is to convert to a map . We do this by an operation (which we denote by ’’) similar to convolving with a steerable filter , where is given by a formula as in Equation 4. We define, for ,


as in Equation (8) of [3]. In our CNN, is actually a matrix, not a function. The meaning of Equation 7 is that the steerable filter needs to be sampled on the integer grid rotated through an angle in a clockwise direction and that the resulting matrix is then convolved with the matrix .

The Supplementary Material for [3] proves that this operation is rotation-equivariant, namely that, for any ,


Classical convolution is also inherently translation equivariant. Combining this translation invariance with rotation invariance from Equation 8, we obtain -equivariance for ’’.

Equation (9) of [3] gives details of how our Equation 5 can be interpreted in terms of classical convolution in our special case of . Once again, this involves sampling on rotated versions of the standard square grid. We also need to permute the layers cyclically.

Iv Dense Steerable Filter CNN

Fig. 3: Overview of the two types of convolution used in our approach. The input convolution learns a steerable filter, which is then rotated to give filter orientations. Each oriented filter is convolved with the original input to give feature maps. The feature map with the red border is the result of the convolution between the input and the filter with the same corresponding red border. The hidden layer convolution learns steerable filters. These filters are a function on the group and therefore undergo a channel permutation with rotation. The convolution of the steerable filters highlighted in red with the input gives the feature map as output also highlighted in red.

Iv-a Network architecture

The main building blocks of our proposed rotation-equivariant DSF-CNN333Model code: are: an input steerable filter -convolution layer; steerable filter -dense-blocks and a -pooling layer. Below, we build on the theoretical explanation in Section III to describe the separate components of our proposed approach.

Input G-convolution: Up to the -pooling operation, all convolutions within our network are steerable -convolutions, as described in Section III-C. Therefore, we pre-define a set of circular harmonic basis filters using Equation 3 and sample the filters on the square grid, as can be seen in Fig. 2. Then, we learn how to linearly combine these atomic basis filters to generate steerable filters and consider only the real part for our convolution filter, as shown in Equation 4. This can be visualised in Figure 3a. For the input steerable -convolution, we create rotated copies of each steerable filter and independently convolve the filters with the input, which is a function on the plane . As a result, the input steerable -convolution produces feature maps that are a function on the group , with the group representation . This operation can be seen in part Fig. 3b, where the convolution between then input and the steerable filter bordered in red produces the feature map also bordered in red. Now, when the input is rotated and the input -convolution is performed, the feature maps also rotate, but in addition undergo a channel permutation.

G-dense-blocks: To enable efficient gradient propagation, encourage feature re-use and to improve overall performance, we use dense connectivity [7] between -convolutions in hidden layers of the network. For our hidden layer steerable -convolutions, the input is now a function on and therefore we must similarly ensure that our steerable filters are a function on . Therefore, when rotating these filters, they should undergo an additional channel permutation. This can be seen in Fig. 3c. Here, we see that steerable filters are generated as shown by the red circle. These filters are convolved with the input to generate the feature map with the red border. Then, as filters rotate, they also shift channel position and the convolution operation is repeated to produce the next feature map. For each -dense-block, the feature-maps of all preceding layers are concatenated to the input before performing the -convolution. This increases the number of connections between layers, strengthening feature propagation. Specifically, each -dense-block consists of units. Each unit contains a 77 -convolution followed by a 55 -convolution that produce 14 and 6 orientation dependent feature maps respectively. After units, the -dense-block concludes by applying a final 55 -convolution.

G-pooling: At the output of the network, we convert our orientation-dependent feature maps that are a function on to a function on . We do this by selecting the maximum value of each spatial location over the orientations. This operation ensures that the output of -pooling is invariant to rotation of the input.

Classification: For our classification DSF-CNN, we initially perform the input steerable -convolution followed by a hidden layer -convolution. We then use 4 -dense-blocks, where each block consists of 3,4,5 and 6 dense units. After every

-convolution layer we use a group equivariant batch normalisation that aggregates moments per group rather than spatial feature map and ReLU non-linearity. Before every

-dense-block, we perform spatial max-pooling to decrease the dimensions of the feature maps. After the final

-dense-block, we perform -pooling and then apply 3 11 classical convolution operations to get the final output.

Segmentation: We extend our DSF-CNN to the task of segmentation by up-sampling feature maps after the final -dense-block in the aforementioned classification CNN. Specifically, we up-sample by a factor of 2 with bilinear interpolation and then utilise a -dense-block. This is repeated until the spatial dimensions of the original image are regained. From the deepest layer of the up-sampling branch, each dense-block contain 4, 3 and 2 units. In line with U-Net [35], we also use skip connections to propagate information from the encoder to the decoder. After the feature maps have been up-sampled, we use a single hidden layer -convolution, which is followed by -pooling such that the feature maps are once again a function on . Finally we use 2 11 classical convolutions to obtain the output, where we segment both the object and the contour to help separate touching instances. For nuclear segmentation, we additionally predict the eroded nuclei masks which are used as markers in marker-controlled watershed.

V Experiments and Results

V-a Experimental overview

Recently, there has been a growing number of proposed CNNs that achieve rotation-equivariance [2, 3, 4, 30], yet there is lack of comprehensive evaluation of the various methods for the analysis of histopathology images. We perform a thorough comparison of various rotation-equivariant CNNs and demonstrate the effectiveness of the proposed model. Specifically, we compare a baseline CNN with VF-CNNs [4], -CNNs with standard filters [2, 30] and -CNNs with steerable filters [3] and assess the impact of increasing the number of filter rotations in each model. For a thorough analysis, each method is applied to the tasks of breast tumour classification, nuclear segmentation and gland segmentation. After gaining an insight into the performance of the different rotation-equivariant models, we then compare our proposed Dense Steerable Filter CNN with the state-of-the-art methods on each of the three datasets used in our experiments.

Fig. 4: Image regions from the three datasets. For nuclear segmentation, gland segmentation and tumour classification, we use the Kumar [17], CRAG [13] and PCam [26] datasets. Yellow boundaries show the pathologist annotation, while green and red borders denote non-tumour and tumour image patches.

V-B The three datasets

We use the following three publicly available histology image datasets:
Breast tumour classification: PCam [26] is a dataset of 327K image patches of size 9696 pixels at 10 extracted from the Camelyon16 dataset [36], containing 400 H&E stained breast WSIs. Each image patch was labelled as tumour if the central region (3232) contained at least one tumour pixel as given by the original annotation [36].
Multi-tissue nucleus segmentation: The Kumar [17] dataset contains 30 1,0001,000 image tiles from seven organs (6 breast, 6 liver, 6 kidney, 6 prostate, 2 bladder, 2 colon and 2 stomach) of The Cancer Genome Atlas (TCGA) database acquired at 40 magnification. Within each image, the boundary of each nucleus is fully annotated.
Colorectal gland segmentation: The CRAG dataset [13] consists of 213 H&E images mostly of size 1,5121,516 pixels taken from 38 WSIs acquired at 20 of colorectal adenocarcinoma (CRA) patients. It is split into 173 training images and 40 test images with different cancer grades with pixel-based gland annotation.

V-C Evaluation metrics

Here we describe the metrics used for evaluation. For tumour classification, we calculated the area under the receiver operating characteristic curve (AUC) to assess the binary classification performance. For gland segmentation, we employed the same quantitative measures that were used in the GlaS challenge

[37]. These metrics consist of , DICE and Hausdorff distance at the object level and assess the quality of instance segmentation. For nuclear segmentation, we report the binary DICE and panoptic quality (PQ). Here, the binary DICE assesses the ability of the method to distinguish nuclei from the background, whereas PQ provides insight into the quality of instance segmentation.

V-D Comparative analysis of rotation-equivariant models

Baseline models: For the task of breast tumour classification, we implement a baseline CNN for comparison with the aforementioned rotation-equivariant models. The model consists of a series of convolution, batch normalisation, non-linear and spatial pooling operations, which are then followed by three 1

1 convolutions to obtain the final output, denoting the probability of an input patch being tumour.

For the tasks of gland and nuclear segmentation we leverage the power of the fully convolutional neural network architecture, which allows us to use the same model architecture, irrespective of the input size. The encoder of the baseline segmentation model uses the same architecture as the baseline classification CNN. Then a series of up-sampling and convolution operations are used to regain the spatial dimensions of the original image. In line with U-Net, we use skip connections to incorporate features from the encoder, but utilise summation as opposed to concatenation. In line with our proposed model described in Section IV-A, at the output of the network we perform segmentation of the object and the contour and additionally predict the eroded masks for nuclear segmentation.

Rotation-equivariant models: To assess the performance of various rotation-equivariant approaches, we modify the baseline models, but keep the fundamental architecture the same. The main difference between different models is how the filters are rotated, how many filter orientations are considered and how the convolution operation is performed.

For each rotation-equivariant model we consider 4, 8 and 12 filter orientations. When applying rotation to a filter with an angle that is a multiple of , the rotation is exact because the output can still be represented on the square grid. However, any other rotation may give interpolation artefacts and therefore may have negative implications for rotation-equivariance. Therefore, in line with Marcos et al. [4] and Lafarge et al. [5], for both the VF-CNN and standard -CNN, we apply circular masking to the filters when using the groups and . However, this masking still leads to inevitable interpolation artefacts in the centre of the filter. Steerable filters as defined by (3) do not suffer from interpolation artefacts and, therefore, circular masking is not needed.

In all comparative experiments for rotation-equivariance, we fix each filter to be of size 77. We used a larger filter than typically used in modern CNNs because this size ensures that we can construct a good basis set for steerable filter generation, with reasonable frequency content and reduced aliasing.

For fair comparison, we ensure that the number of parameters is similar between different models. For both standard and steerable -CNNs, the number of parameters increases with the size of the group. This is because one feature map is produced per orientation of the filter, which increases the number of required filters in the subsequent layer. To maintain the same number of parameters as the baseline CNN, we divide the number of filters in each layer by , where is the number of orientations in the group. Instead of carrying forward all orientations throughout the network, VF-CNNs collapse the orientation dependent feature maps to two feature maps, representing magnitude and angle. Therefore, the VF-CNN requires more filters in the next layer, but the number of parameters stays constant irrespective of the size of the group. To ensure the same number of parameters as the baseline CNN, for all group sizes we divide the number of filters in each layer of VF-CNNs by .

In all models, we down-sample with max-pooling, but for VF-CNNs we use a modified pooling strategy, based on the magnitude of the feature maps. Similarly, when using VF-CNNs, we do not incorporate the angle information when using batch normalisation (BN) and non-linear activation functions; otherwise the angles may change important information about relative and global orientations. For

-CNNs, we use a modified BN that aggregates moments per group rather than spatial feature map.

To verify our implementations of the various rotation-equivariant networks, we cross-checked the performance of each model against reported benchmarks on the rotated MNIST dataset [38] before applying them to the histology datasets.

V-E Quantitative results

Tumour classification: We report comparative results of different rotation-equivariant models on the PCam dataset at the top of Table I. We observe that the VF-CNN does not perform as well as the baseline CNN for the task of tumour classification. Despite this, we see that the performance improves when we increase the number of filter rotations. This may be because the VF-CNN undergoes a significant change in numerous standard operations to enable us to work with vector fields. When we utilise the group convolution, with filter rotation as performed by Bekkers et al. [30] and Lafarge et al. [5], we see an improved performance when using up to 8 filter orientations. This gain in performance can be attributed to incorporating our prior knowledge of rotational symmetry into the network. To ensure that we maintain a similar number of parameters, we need to reduce the number of feature maps at each layer when the size of the group is increased. This may explain the drop in performance when using 12 filter orientations. When using steerable filters, but with no filter rotation, we observe an improved performance over conventional CNNs, highlighting the benefit of learning a linear combination of basis filters, rather than standard filters. Then, as we increase the size of the group to 4 and 8 orientations we see an improvement in the performance. We also observe that using steerable filters rather than standard filters within the -convolution gives a better result.

At the bottom of Table I we compare the performance of our proposed DSF-CNN with the -DenseNet [26], which is the top performing method that was proposed with the introduction of the PCam dataset. This approach integrates the use of -convolutions on, as proposed by Cohen & Welling [2], into a densely connected CNN [7]. Here, the network uses filter rotations by multiples of 90 and also uses reflections. This is denoted by , which is the dihedral group containing 4 rotation and 4 reflection symmetries. In addition, we compare results to the commonly used ResNet-34 [10], ResNet-50 [10], DenseNet-121 [7] and DenseNet-169 [7]. Despite the small amount of parameters, we observe that our method achieves the best performance with an AUC of 0.975, which is a promising improvement over the previous state-of-the-art.

Gland segmentation: We compare the performance of the different rotation-equivariant models for gland segmentation on the CRAG dataset in the top part of Table II. Here, we again see that the VF-CNNs are inferior to regular CNNs, but observe an increase in the performance when incorporating more filter rotations within the network. Similar to our observations for breast tumour classification, we see that increasing the group size within the group convolution leads to an increase in performance, but the best performance is achieved when using 8 filter orientations. For this task, using steerable filters in the group convolution led to the best performance.

Method Group Parameters AUC
CNN {} 564 0.949
-CNN [2] 561 0.964
-CNN [30, 5] 557 0.968
-CNN [30, 5] 557 0.962
VF-CNN [4] 556 0.871
VF-CNN [4] 556 0.881
VF-CNN [4] 556 0.898
Steerable -CNN [3] {} 553 0.963
Steerable -CNN [3] 546 0.969
Steerable -CNN [3] 565 0.971
Steerable -CNN [3] 545 0.969
ResNet-34 [10] 21.3 0.942
ResNet-50 [10] 23.5 0.948
DenseNet-121 [7] 7.8 0.921
DenseNet-169 [7] 13.3 0.920
-DenseNet [26] 119 0.963
DSF-CNN (Ours) 2.2 0.975
TABLE I: Tumour classification results on the PCam dataset [26]. Top: comparison of different rotation-equivariant models with a similar parameter budget. Bottom: comparison of proposed approach with state-of-the-art.
Method Group Params Obj F Obj Dice Obj Haus

{} 984 0.793 0.809 246.0
-CNN [2] 982 0.833 0.856 170.4
-CNN [30, 5] 988 0.837 0.866 157.4
-CNN [30, 5] 979 0.818 0.834 192.2
VF-CNN [4] 975 0.711 0.721 318.9
VF-CNN [4] 975 0.745 0.758 287.5
VF-CNN [4] 975 0.776 0.782 251.9
Steerable -CNN [3] {} 981 0.811 0.848 175.9
Steerable -CNN [3] 984 0.837 0.869 164.8
Steerable -CNN [3] 989 0.861 0.888 139.5
Steerable -CNN [3] 976 0.855 0.870 156.2
FCN8 [35] {} 134.3 0.796 0.835 199.5
U-Net [35] {} 37.0 0.827 0.844 196.9
MILD-Net [13] {} 83.3 0.869 0.883 146.2
Rota-Net [28] {} 71.3 0.869 0.887 144.2
DSF-CNN (Ours) 3.7 0.874 0.891 138.4
TABLE II: Gland segmentation results on the CRAG [13] dataset. Top: comparison of different rotation-equivariant models with a similar parameter budget. Bottom: comparison of proposed approach with state-of-the-art.
Method Group Params B-Dice PQ

{} 984 0.767 0.447
-CNN [2] 982 0.793 0.490
-CNN [30, 5] 988 0.811 0.519
-CNN [30, 5] 979 0.814 0.534
VF-CNN [4] 975 0.800 0.499
VF-CNN [4] 975 0.808 0.507
VF-CNN [4] 975 0.813 0.514
Steerable -CNN [3] {} 981 0.791 0.510
Steerable -CNN [3] 984 0.809 0.542
Steerable -CNN [3] 989 0.818 0.543
Steerable -CNN [3] 976 0.820 0.558
FCN8 [39] {} 134.3 0.797 0.312
SegNet [badrinarayanan2017segnet] {} 29.4 0.811 0.407
U-Net [35] {} 37.0 0.758 0.478
Mask-RCNN [40] {} 40.1 0.760 0.509
DIST [16] {} 9.2 0.789 0.443
Micro-Net [41] {} 192.6 0.797 0.519
CIA-Net [42] {} 22.0 0.818 0.577
HoVer-Net [15] {} 54.7 0.826 0.597
DSF-CNN (Ours) 3.7 0.826 0.600
TABLE III: Nuclear segmentation results on the Kumar [17] dataset. Top: comparison of different rotation-equivariant models with a similar parameter budget. Bottom: comparison of proposed approach with state-of-the-art.

In the bottom part of Table II, we compare our proposed approach with MILD-Net [13] and Rota-Net [28], which are top-performing gland segmentation methods and therefore can be appropriately used for performance benchmarking. Like the -DesneNet, Rota-Net makes use of the standard -convolution, but is limited to only 90 filter rotations. In addition, we compare with FCN8 and U-Net as they are two widely used CNNs for segmentation. We observe that our DSF-CNN achieves the best performance with a fraction of the parameter budget. Notably, our model has around 20 times fewer parameters than Rota-Net and MILD-Net.

Nuclear segmentation: We report the comparative results of different rotation-equivariance methods for nuclear segmentation on the Kumar dataset in the top part of Table III. Here, we see that all rotation-equivariant approaches show a significant improvement over conventional CNNs and we see an improvement when increasing the number of filter orientations to 12 in all models. It is interesting to see that, for this task, VF-CNNs perform well. Therefore, we speculate that this model may be better suited to smaller datasets. This would also explain why we see an increase in performance for up to 12 filter orientations when using group convolutions on this dataset – i.e., reducing the number of feature maps per layer for large group sizes will have less of an impact on models trained on small datasets. This is because less parameters are needed to fit to the training set. Once again, we observe that the steerable group convolutional networks for segmentation of nuclei are superior to standard group convolutional networks that use bilinear interpolation during filter rotation.

We evaluate the performance of our proposed method with several state-of-the-art approaches in the bottom part of Table III. In particular, HoVer-Net [15], CIA-Net [42], Micro-Net [41] and DIST [16] have been purpose-built for the task of nuclear segmentation and, therefore, provide a competitive benchmark. The proposed DSF-CNN once again achieves the best performance compared to other methods for both binary DICE and panoptic quality, on par with the state-of-the-art HoVer-Net method, while requiring a fraction of the parameter count.

Fig. 5: Variance between the predictions and features for multiple orientations of the input. The original image is rotated with steps of to give 8 orientations and each copy is passed through the network to enable variance calculation. Features A and B are located at the beginning and end of the network respectively. The rotation-equivariant CNN we compare with is the steerable -CNN.

V-F Visual results

In Fig. 5 we visualise the features and the corresponding outputs as we rotate the input with angle increments of (8 in total) for both the baseline CNN and -steerable G-CNN. Specifically, we analyse the properties of both CNNs trained for the tasks of gland and nuclear segmentation. To observe the feature map transformation with rotation of the input, we analyse two sets of feature maps: Feature Map A at the output of 2nd -convolution and Feature Map B at the output of the final -convolution. Similarly, we observe how the output probability map transforms when the input is rotated.

To analyse this, we feed each image orientation into the network to obtain a set of orientation-dependent features and output probability maps. Then, after rotating features and probability maps back to their original orientation, we compute the pixel-wise variance map of the features and the output to see how they change with rotation of the input. For the rotation-equivariant model, we observe that there is a near-negligible variance between the features of each input orientation. On the other hand, there is much higher variance between orientation-dependent features of standard CNNs. This implies that the rotation-equivariant CNN successfully learns an equivariant feature representation. Also, there is a smaller variance between the predictions of multiple input orientations for the rotation-equivariant CNN as compared to the standard CNN. Thus, the rotation-equivariant CNN behaves as expected with rotation of the input, which is a particularly desirable property when training CNNs with histology image data.

We display visual results achieved by our proposed DSF-CNN for both nuclear and gland segmentation in Fig. 6 and 7. We observe that our model is able to achieve a good quality segmentation that closely resembles the ground truth.

V-G Implementation and training details

We implemented our framework with the open source software library TensorFlow version 1.12.0 

[43] on a workstation equipped with two NVIDIA GeForce 1080 Ti GPUs. During training, data augmentation including flip, rotation, Gaussian blur and median blur was applied. For breast tumour classification, we fed the original patches of size 9696 into the network. For gland and nuclear segmentation, we used patches of size 448448 and 256256 respectively. For tumour classification, we trained our model using a batch size of 32 and then used a batch size of 8 for both gland and nucleus segmentation. We used cross-entropy loss for all tumour classification and gland segmentation models, whereas we used a combination of weighted cross-entropy and dice loss for nuclear segmentation. For all models, we trained using Adam optimisation with an initial learning rate of 10, that was reduced as training progressed. The network was trained with an RGB input, normalised between 0 and 1.

Fig. 6: Visual results for nuclear segmentation on the Kumar dataset [17] using our proposed DSF-CNN.
Fig. 7: Visual results for gland segmentation on the CRAG dataset [13] using our proposed DSF-CNN.

Vi Discussion and Conclusions

Conventional CNNs do not behave as expected with rotation of the input, which is a particularly undesirable property in the field of computational pathology, where important features in histology images can appear at any orientation. Instead, rotation-equivariant CNNs build this prior knowledge of rotational symmetry within the network, such that features rotate in accordance with the input without explicitly learning features at various orientations. In this paper, we proposed a densely connected steerable filter CNN that achieves state-of-the-art performance on the three datasets used in our experiments with a fraction of the parameter budget of recent top-performing models. We conducted a thorough comparative analysis of various rotation-equivariant CNNs applied to the tasks of breast tumour classification, gland segmentation and nuclear segmentation. We showed that steerable filter group convolutions gave the best quantitative results on all three tasks, where 8 filter orientations consistently gave a strong performance. We visualised features within a rotation-equivariant model to demonstrate that they rotate with the input and therefore have a higher degree of feature map interpretability. Finally, we showed that rotation-equivariant models give more stable predictions with input rotation than regular CNNs do. In future work, we will consider incorporating additional symmetries into the group convolution, such as mirror and scale symmetries. This will further increase the interpretability of feature maps and may lead to an improvement in performance. Also, the exploration of further symmetries in histology images may help direct future research in computational pathology

This work was supported in part by the UK Medical Research Council (No. MR/P015476/1). NR is part of the PathLAKE digital pathology consortium, which is funded from the Data to Early Diagnosis and Precision Medicine strand of the government’s Industrial Strategy Challenge Fund, managed and delivered by UK Research and Innovation (UKRI).


  • [1] D. R. Snead, Y.-W. Tsang, A. Meskiri, P. K. Kimani, R. Crossman, N. M. Rajpoot, E. Blessing, K. Chen, K. Gopalakrishnan, P. Matthews et al., “Validation of digital pathology imaging for primary histopathological diagnosis,” Histopathology, vol. 68, no. 7, pp. 1063–1072, 2016.
  • [2] T. Cohen and M. Welling, “Group equivariant convolutional networks,” in

    International conference on machine learning

    , 2016, pp. 2990–2999.
  • [3] M. Weiler, F. A. Hamprecht, and M. Storath, “Learning steerable filters for rotation equivariant cnns,” in

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2018.
  • [4] D. Marcos, M. Volpi, N. Komodakis, and D. Tuia, “Rotation equivariant vector field networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5048–5057.
  • [5] M. W. Lafarge, E. J. Bekkers, J. P. Pluim, R. Duits, and M. Veta, “Roto-translation equivariant convolutional networks: Application to histopathology image analysis,” arXiv preprint arXiv:2002.08725, 2020.
  • [6] W. T. Freeman and E. H. Adelson, “The design and use of steerable filters,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 9, pp. 891–906, 1991.
  • [7] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” ArXiv e-prints, p. arXiv:1608.06993, Aug. 2016.
  • [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [9] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [11]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in

    CVPR09, 2009.
  • [12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision.   Springer, 2014, pp. 740–755.
  • [13] S. Graham, H. Chen, J. Gamper, Q. Dou, P.-A. Heng, D. Snead, Y. W. Tsang, and N. Rajpoot, “Mild-net: Minimal information loss dilated network for gland instance segmentation in colon histology images,” Medical image analysis, vol. 52, pp. 199–211, 2019.
  • [14] H. Chen, X. Qi, L. Yu, and P.-A. Heng, “Dcan: deep contour-aware networks for accurate gland segmentation,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016, pp. 2487–2496.
  • [15] S. Graham, Q. D. Vu, S. E. A. Raza, A. Azam, Y. W. Tsang, J. T. Kwak, and N. Rajpoot, “Hover-net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images,” Medical Image Analysis, vol. 58, p. 101563, 2019.
  • [16] P. Naylor, M. Laé, F. Reyal, and T. Walter, “Segmentation of nuclei in histopathology images by deep regression of the distance map,” IEEE transactions on medical imaging, vol. 38, no. 2, pp. 448–459, 2018.
  • [17] N. Kumar, R. Verma, S. Sharma, S. Bhargava, A. Vahadane, and A. Sethi, “A dataset and a technique for generalized nuclear segmentation for computational pathology,” IEEE transactions on medical imaging, vol. 36, no. 7, pp. 1550–1560, 2017.
  • [18] S. U. Akram, T. Qaiser, S. Graham, J. Kannala, J. Heikkilä, and N. Rajpoot, “Leveraging unlabeled whole-slide-images for mitosis detection,” in Computational Pathology and Ophthalmic Medical Image Analysis.   Springer, 2018, pp. 69–77.
  • [19] S. Graham, M. Shaban, T. Qaiser, N. A. Koohbanani, S. A. Khurram, and N. Rajpoot, “Classification of lung cancer histology images using patch-level summary statistics,” in Medical Imaging 2018: Digital Pathology, vol. 10581.   International Society for Optics and Photonics, 2018, p. 1058119.
  • [20] E. Arvaniti, K. S. Fricker, M. Moret, N. Rupp, T. Hermanns, C. Fankhauser, N. Wey, P. J. Wild, J. H. Rueschoff, and M. Claassen, “Automated gleason grading of prostate cancer tissue microarrays via deep learning,” Scientific reports, vol. 8, no. 1, pp. 1–11, 2018.
  • [21] M. Shaban, R. Awan, M. M. Fraz, A. Azam, Y. Tsang, D. Snead, and N. M. Rajpoot, “Context-aware convolutional neural network for grading of colorectal cancer histology images,” IEEE Transactions on Medical Imaging, pp. 1–1, 2020.
  • [22] D. Tellez, G. Litjens, P. Bandi, W. Bulten, J.-M. Bokhorst, F. Ciompi, and J. van der Laak, “Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology,” arXiv preprint arXiv:1902.06543, 2019.
  • [23] A. Azulay and Y. Weiss, “Why do deep convolutional networks generalize so poorly to small image transformations?” arXiv preprint arXiv:1805.12177, 2018.
  • [24] N. Moshkov, B. Mathe, A. Kertesz-Farkas, R. Hollandi, and P. Horvath, “Test-time augmentation for deep learning-based cell segmentation on microscopy images,” bioRxiv, p. 814962, 2019.
  • [25] D. Laptev, N. Savinov, J. M. Buhmann, and M. Pollefeys, “Ti-pooling: transformation-invariant pooling for feature learning in convolutional neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 289–297.
  • [26] B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, and M. Welling, “Rotation equivariant cnns for digital pathology,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2018, pp. 210–218.
  • [27] J. Linmans, J. Winkens, B. S. Veeling, T. S. Cohen, and M. Welling, “Sample efficient semantic segmentation using rotation equivariant convolutional networks,” arXiv preprint arXiv:1807.00583, 2018.
  • [28] S. Graham, D. Epstein, and N. Rajpoot, “Rota-net: Rotation equivariant network for simultaneous gland and lumen segmentation in colon histology images,” in European Congress on Digital Pathology.   Springer, 2019, pp. 109–116.
  • [29] E. Hoogeboom, J. W. Peters, T. S. Cohen, and M. Welling, “Hexaconv,” arXiv preprint arXiv:1803.02108, 2018.
  • [30] E. J. Bekkers, M. W. Lafarge, M. Veta, K. A. Eppenhof, J. P. Pluim, and R. Duits, “Roto-translation covariant convolutional networks for medical image analysis,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2018, pp. 440–448.
  • [31] Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao, “Oriented response networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 519–528.
  • [32] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow, “Harmonic networks: Deep translation and rotation equivariance,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5028–5037.
  • [33] M. Weiler and G. Cesa, “General e (2)-equivariant steerable cnns,” in Advances in Neural Information Processing Systems, 2019, pp. 14 334–14 345.
  • [34] student (, “Every measurable homomorphism from to is exponential.” Mathematics Stack Exchange, uRL: (version: 2013-07-13). [Online]. Available:
  • [35] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  • [36] B. E. Bejnordi, M. Veta, P. J. Van Diest, B. Van Ginneken, N. Karssemeijer, G. Litjens, J. A. Van Der Laak, M. Hermsen, Q. F. Manson, M. Balkenhol et al., “Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer,” Jama, vol. 318, no. 22, pp. 2199–2210, 2017.
  • [37] K. Sirinukunwattana, J. P. Pluim, H. Chen, X. Qi, P.-A. Heng, Y. B. Guo, L. Y. Wang, B. J. Matuszewski, E. Bruni, U. Sanchez et al., “Gland segmentation in colon histology images: The glas challenge contest,” Medical image analysis, vol. 35, pp. 489–502, 2017.
  • [38] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, “An empirical evaluation of deep architectures on problems with many factors of variation,” in Proceedings of the 24th international conference on Machine learning, 2007, pp. 473–480.
  • [39] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  • [40] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” ArXiv e-prints, p. arXiv:1703.06870, Mar. 2017.
  • [41] S. E. A. Raza, L. Cheung, M. Shaban, S. Graham, D. Epstein, S. Pelengaris, M. Khan, and N. M. Rajpoot, “Micro-net: A unified model for segmentation of various objects in microscopy images,” Medical image analysis, vol. 52, pp. 160–173, 2019.
  • [42] Y. Zhou, O. F. Onder, Q. Dou, E. Tsougenis, H. Chen, and P.-A. Heng, “Cia-net: Robust nuclei instance segmentation with contour-aware information aggregation,” arXiv preprint arXiv:1903.05358, 2019.
  • [43] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.