Roto-Translation Covariant Convolutional Networks for Medical Image Analysis

by   Erik J. Bekkers, et al.
TU Eindhoven

We propose a framework for rotation and translation covariant deep learning using SE(2) group convolutions. The group product of the special Euclidean motion group SE(2) describes how a concatenation of two roto-translations results in a net roto-translation. We encode this geometric structure into convolutional neural networks (CNNs) via SE(2) group convolutional layers, which fit into the standard 2D CNN framework, and which allow to generically deal with rotated input samples without the need for data augmentation. We introduce three layers: a lifting layer which lifts a 2D (vector valued) image to an SE(2)-image, i.e., 3D (vector valued) data whose domain is SE(2); a group convolution layer from and to an SE(2)-image; and a projection layer from an SE(2)-image to a 2D image. The lifting and group convolution layers are SE(2) covariant (the output roto-translates with the input). The final projection layer, a maximum intensity projection over rotations, makes the full CNN rotation invariant. We show with three different problems in histopathology, retinal imaging, and electron microscopy that with the proposed group CNNs, state-of-the-art performance can be achieved, without the need for data augmentation by rotation and with increased performance compared to standard CNNs that do rely on augmentation.



page 6


Roto-Translation Equivariant Convolutional Networks: Application to Histopathology Image Analysis

Rotation-invariance is a desired property of machine-learning models for...

PDE-based Group Equivariant Convolutional Neural Networks

We present a PDE-based framework that generalizes Group equivariant Conv...

Rotational 3D Texture Classification Using Group Equivariant CNNs

Convolutional Neural Networks (CNNs) traditionally encode translation eq...

Group Equivariant Subsampling

Subsampling is used in convolutional neural networks (CNNs) in the form ...

DISCO: accurate Discrete Scale Convolutions

Scale is often seen as a given, disturbing factor in many vision tasks. ...

Invariance reduces Variance: Understanding Data Augmentation in Deep Learning and Beyond

Many complex deep learning models have found success by exploiting symme...

Tensor Field Networks: Rotation- and Translation-Equivariant Neural Networks for 3D Point Clouds

We introduce tensor field networks, which are locally equivariant to 3D ...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this work we generalize convolutional neural networks (CNNs) to group CNNs (G-CNNs) in which the data lives on position orientation space, and in which the convolution layers are defined in terms of representations of the special Euclidean motion group . In essence this means that we replace the convolutions (with translations of a kernel) by group convolutions (with roto-translations of a kernel). The advantage of the proposed approach compared to standard CNNs is that rotation covariance is encoded in the network design and does not have to be learned by the convolution kernels. E.g., a feature that may appear in the data under several orientations does not have to be learned for each orientation, but only once. As a result, there is no need for data augmentation by rotation and the kernel weights (that no longer need to learn rotation covariance) become available to increase the CNNs expressive capacity. Moreover, the proposed group convolution layers are compatible with standard CNN modules, allowing for easy integration in popular CNN designs.

A main objective of medical image analysis is to develop models that are invariant to the shape and appearance variability of the structures of interest, including their arbitrary orientations. Rotation-invariance is a desired property, which our G-CNN framework generically deals with. We show state-of-the-art results with improvement over standard 2D CNNs on three different medical imaging tasks: mitosis detection in histopathology images, vessel segmentation in retinal images and cell boundary segmentation in electron microscopy (EM).

1.1 Related work

In relation to other approaches that incorporate rotation invariance/covariance in the network design, such as harmonic networks [1], local transformation invariance learning [2], deep symmetry nets [3], scattering CNNs [4, 5], and warped convolutions [6], the group convolution approaches [7, 8, 9, 4, 5, 10] most naturally extend the standard CNNs by simply replacing the convolution operators.

In the work by Cohen & Welling [7] a comprehensive theoretical framework for G-CNNs is developed for discrete groups whose transformations stay on the pixel grid. In particular their focus was on the wall-paper groups (group of translations + rotations), for which a G-CNN approach was also developed by Dieleman et al. [8], and p4m ( + reflections). In their work it was convincingly demonstrated that including such symmetries, by replacing standard convolutions by group convolutions, substantially increases the network’s performance without increasing the number of network variables. Although their theoretical G-CNN framework [7]

holds for more general groups, their actual application scope was limited to discrete groups that stay on the pixel grid. In this paper, we are not restricted to such groups, but include efficient bi-linear interpolation that allows us to employ the full structure of the continuous roto-translation group

, which we can discretize to the sub-group , with rotations. Special cases of our framework are standard 2D CNNs when and the G-CNNs as proposed in [7, 8] when .

In very recent work, Weiler et al. [9] describe a different approach to G-CNNs. Instead of relying on interpolation they used 2D complex-valued steerable kernels, which has the advantage that kernel rotations are exact. A disadvantage is, however, that these kernels are constrained to a specific combination of complex valued basis functions. With our interpolation approach, kernel rotation simply appears in the CNN architecture as a (sparse) matrix-vector multiplication, that maps a set of base weights to a full set of rotated kernels.

In work by Mallat, Oyallon, and Sifre [5, 4]

roto-translation invariant deep networks are formulated in the context of scattering theory. Their design involves a concatenation of separable group convolutions with hand-crafted (but well underpinned) filters, followed by the modulus as activation function. Learning takes place via support vector machines on the generated

invariant descriptors. In our approach, the filters are learned without restrictions, the convolutions do not have to be separable, and we here use the common ReLU activation function.

In work by Bekkers et al. [10], an effective template matching method was proposed using group correlations in orientation scores, which are images obtained from a 2D image via lifting convolutions with a specific choice of kernel [11]. The

templates were put in a B-spline basis (allowing for exact kernel rotations) and optimized via logistic regression. Their architecture fits within our framework as a single channel G-CNN of depth 2 with a fixed lifting kernel.

2 convolutional neural networks

2.1 Group theoretical preliminaries

The Lie group : The group is the semi-direct product of the group of planar translations and rotations , and its group product is given by


with group elements , with translations and planar rotations by . The group acts on the space of positions and orientations via Since , we can identify the group with the space of positions and orientations . As such we will often write , instead of . Note that since .

Group representations: The structure of the group can be mapped to other mathematical objects (such as 2D images) via representations. The left-regular representation on 2D images is given by


with . It corresponds to a roto-translation of the image. The left-regular representation on functions on , which we refer to as -images, is given by


with . It is a shift-twist (rotation + -shift) of , see e.g. Fig. 1. Next we define the G-CNN layers in terms of these representations.

2.2 The group convolution layers

In CNNs one can take a convolution or a cross-correlation viewpoint and since these operators simply relate via a kernel reflection, the terminology is often used interchangeably. We take the second viewpoint, our G-CNNs are implemented using cross-correlations. On we define cross-correlation via inner products of translated kernels:


with the translation operator, the left-regular representation of the translation group . In the lifting layer we now simply replace translations of by roto-translations via the representation defined in Eq. (2).

The lifting layer: Let be a vector valued 2D image and kernel (with channels), with and , then the group lifting correlations for vector valued images are defined by


These correlations lift 2D image data to data that lives on the 3D position orientation space . The lifting layer that maps from a vector image , with channels at layer , to an vector image using a set of kernels , each with channels, is then defined by


The group convolution layer: Let be a vector valued image and kernel, with and , then the group correlations are defined as


with , the inner product on . A set of kernels defines a group convolution layer, mapping from with channels to with channels, via


The projection layer: Projects a multi-channel image back to via

Figure 1: Rotation co- and invariance. Top row: the activations after the lifting convolutions with a single kernel , stacked together it yields an image (cf. Eq. (6)). The projection layer at the end of the pipeline gives a rotation invariant feature vector. Bottom row: the same figures with a rotated input.

2.3 Discretization and network architecture

Discretization, kernel sizes and rotation: Discretized 2D images are supported on a bounded subset of and the kernels live on a spatially rectangular grid of size in , with the kernel size. We discretize the Lie group , with the space of 2D rotations in sampled with rotation angles , with . The discrete lifting kernels at layer , mapping from a 2D image with input channels to an image with channels, thus have a shape of . The kernels have a shape of . A complete set of rotations of kernels or can be constructed with a single matrix multiplication from a vector that contains the shared kernel weights. This matrix is sparse and encodes bi-linear interpolation and kernel rotation.

(Group) 1 () 2 () 4 () 8 () 16 ()
Layer 1 - lifting with Eq. (6),
   () 16 (1040) 13 (845) 10 (650) 8 (520) 6 (390)
Layer 2,3,4 - group conv. with Eq. (8),
   () 16 (5408) 13 (7124) 10 (8420) 8 (10768) 6 (12108)
   () 16 (5408) 13 (7124) 10 (8420) 8 (10768) 6 (12108)
   () 64 (21632) 32 (17536) 16 (13472) 8 (10768) 4 (8072)
Layer 5 - group conv. with Eq. (8) + projection with Eq. (9),
   () 16 (1056) 16 (1056) 16 (1056) 16 (1056) 16 (1056)
Layer 6 - standard conv. (output) layer,
   () 1 (17) 1 (17) 1 (17) 1 (17) 1 (17)
Total 34561 33702 32035 33897 33751
Table 1: chain settings for different orientation samplings .

3 Experiments and Results

We consider three different tasks in three different modalities. In each we consider the samplings with to study the effect of the choice of in the discretization. See Table. 1 for the network settings. In each experiment the data is augmented at train and test time with transposed versions of the 2D input. For reference we also include transpose plus rotation augmentation for the experiment (as in [12, 13]) in order to be able to show that these are not necessary in our networks for

. Each experiment is repeated 3 times with random initialization and sampling to get a rough estimate of the mean and variance on the performance. For a fair comparison for different

the overall number of weights is matched. For a fair comparison with the approach, the number of ”2D” activations (

) in the last three layers is also matched. Each network optimizes a logistic loss using stochastic gradient descent with momentum using the same settings as in

[12]. Our G-CNN implementations are available at The results are given in Fig. 2, the tasks and metrics are summarized as follows.

Figure 2:

Top row: Crop outs of images of the three tasks with the class probabilities generated by our method. Bottom row: Mean results (

std. dev.).

Histopathology - Mitosis detection: The task aims at detecting mitotic figures in hematoxylin-eosin stained slides. We used the public dataset AMIDA13 [14] that consists of high power field images from 23 breast cancer cases. Eight cases (458 mitoses) were used to train the networks with random batches of

image patches, balanced between mitotic and hard negative figures. This receptive field was obtained by means of max-pooling operations in the first three layers. Sets of candidate detections were generated as in

[13] after selection of an operating point on four validation cases (92 mitoses). We assessed an F-score for each model based on the 11 test cases (533 mitoses) in the conditions of [14].

Retina - Vessel segmentation: In this task the blood vessels in the retina are segmented. For validation we use the public DRIVE database [15], which consists of 40 retinal images with manual segmentations. The set is split in a training set (of which we use 16 for training, and 4 for validation) and a test set of also 20 images. The G-CNNs produce a probability for the vessel and background class. Training is done with patches () per class per image. The output probabilities can be thresholded to create a binary segmentation, which can be used to quantify performance in terms of sensitivity and specificity. The area under the receiver operator characteristic (ROC) curve, in short AUC, summarizes these performances into a single value.

Electron microscopy - Cell boundary segmentation: This task consists of segmenting the boundaries of cells that are imaged with EM. We use the data and evaluation system of the ISBI EM segmentation challenge [16]. The data consists of 2 volumes (1 train, 1 test), each containing 30 consecutive images from a serial section transmission EM. Both the segmentation and the evaluation is done by treating the volumes as sequences of 2D slices. To increase receptive field size we include max pooling in the first 2 layers. Training is done with patches () per class per image. The main evaluation criterion for the challenge is the Rand score, which measures the similarity between clusterings/connected components [17]. The reported Rand score is the maximum score (for several thresholds) computed for the connected components obtained after thinning of the binary cell boundary segmentation, see [16] for more details.

Results: In each experiment we see that the performance of the baseline with extra rotation augmentations is reached by the non-augmented G-CNNs for , and even is surpassed for . In the first two experiments we also observe that the variance on the output is reduced with increasing . Our results on the public datasets match or improve upon the state of the art with the following scores: F-score= for mitosis detection, AUC = for vessel segmentation, Rand = for cell boundary segmentation.

4 Discussion and Conclusions

We showed a consistent improvement of performances across three medical image analysis tasks when using G-CNNs compared to their corresponding CNN baselines. The reported results are in line with the benchmark of each dataset and the best performances were obtained for an orientation capacity , indicating the advantage of learning such rotation-invariant representations. We observed improved stability over the repeated experiments in mitosis detection and vessel segmentation for and , suggesting a regularization effect due to the increased weight sharing with increasing .

We conclude that it is beneficiary to include group convolution layers in CNN network design, as this avoids the need for rotation augmentation and it improves overall performance. In all three medical imaging problems we achieved state-of-the-art results with the same (basic) network design for each task. Based on these results we expect that our layers may lead to a further performance increase when embedded in more complex network designs, such as the popular UNets and ResNets.

Acknowledgements: The research leading to these results has received funding from the ERC council under the EC’s 7th Framework Programme (FP7/2007–2013) / ERC grant agr. No. 335555.


  • [1] Worrall, D.E., Garbin, S.J., Turmukhambetov, D., Brostow, G.J.: Harmonic networks: Deep translation and rotation equivariance. In: CVPR. (2017) 5028–5037
  • [2] Sohn, K., Lee, H.: Learning invariant representations with local transformations. In: CVPR, Omnipress (2012) 1339–1346
  • [3] Gens, R., Domingos, P.M.: Deep symmetry networks. In: Advances in neural information processing systems. (2014) 2537–2545
  • [4] Sifre, L., Mallat, S.: Rotation, scaling and deformation invariant scattering for texture discrimination. In: CVPR, IEEE (2013) 1233–1240
  • [5] Oyallon, E., Mallat, S., Sifre, L.: Generic deep networks with wavelet scattering. arXiv preprint arXiv:1312.5940 (2013)
  • [6] Henriques, J.F., Vedaldi, A.: Warped convolutions: Efficient invariance to spatial transformations.

    In: Int. Conf. on Machine Learning. (2017) 1461–1469

  • [7] Cohen, T., Welling, M.: Group equivariant convolutional networks. In: Int. Conf. on Machine Learning. (2016) 2990–2999
  • [8] Dieleman, S., De Fauw, J., Kavukcuoglu, K.: Exploiting cyclic symmetry in convolutional neural networks. arXiv preprint arXiv:1602.02660 (2016)
  • [9] Weiler, M., Hamprecht, F.A., Storath, M.: Learning steerable filters for rotation equivariant cnns. arXiv preprint arXiv:1711.07289 (2017)
  • [10] Bekkers, E.J., Loog, M., ter Haar Romeny, B.M., Duits, R.: Template matching via densities on the roto-translation group. IEEE tPAMI 40(2) (2018) 452–466
  • [11] Duits, R., Felsberg, M., Granlund, G.H., ter Haar Romeny, B.M.: Image analysis and reconstruction using a wavelet transform constructed from a reducible representation of the Euclidean motion group. IJCV 72(1) (2007) 79–102
  • [12] Lafarge, M.W., Pluim, J.P., Eppenhof, K.A., Moeskops, P., Veta, M.: Domain-adversarial neural networks to address the appearance variability of histopathology images. In: MICCAI-DLMIA 2017. Springer (2017) 83–91
  • [13] Cireşan, D.C., Giusti, A., et al.: Mitosis detection in breast cancer histology images with deep neural networks. In: MICCAI, Springer (2013) 411–418
  • [14] Veta, M., van Diest, P., Willems, S., et al.: Assessment of algorithms for mitosis detection in breast cancer histopathology images. MEDIA 20(1) (2015) 237–248
  • [15] Staal, J., Abràmoff, M.D., Niemeijer, M., et al.: Ridge-based vessel segmentation in color images of the retina. IEEE TMI 23(4) (2004) 501–509
  • [16] Arganda-Carreras, I., Turaga, S.C., et al.: Crowdsourcing the creation of image segmentation algorithms for connectomics. Front. in neuroanatomy 9 (2015) 142
  • [17] Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 66(336) (1971) 846–850