Rotational 3D Texture Classification Using Group Equivariant CNNs

10/16/2018
by   Vincent Andrearczyk, et al.
0

Convolutional Neural Networks (CNNs) traditionally encode translation equivariance via the convolution operation. Generalization to other transformations has recently received attraction to encode the knowledge of the data geometry in group convolution operations. Equivariance to rotation is particularly important for 3D image analysis due to the large diversity of possible pattern orientations. 3D texture is a particularly important cue for the analysis of medical images such as CT and MRI scans as it describes different types of tissues and lesions. In this paper, we evaluate the use of 3D group equivariant CNNs accounting for the simplified group of right-angle rotations to classify 3D synthetic textures from a publicly available dataset. The results validate the importance of rotation equivariance in a controlled setup and yet motivate the use of a finer coverage of orientations in order to obtain equivariance to realistic rotations present in 3D textures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

04/10/2018

Roto-Translation Covariant Convolutional Networks for Medical Image Analysis

We propose a framework for rotation and translation covariant deep learn...
07/24/2017

Wavelet Convolutional Neural Networks for Texture Classification

Texture classification is an important and challenging problem in many i...
02/11/2020

Edge-Gated CNNs for Volumetric Semantic Segmentation of Medical Images

Textures and edges contribute different information to image recognition...
03/19/2020

Local Rotation Invariance in 3D CNNs

Locally Rotation Invariant (LRI) image analysis was shown to be fundamen...
02/20/2020

Roto-Translation Equivariant Convolutional Networks: Application to Histopathology Image Analysis

Rotation-invariance is a desired property of machine-learning models for...
01/24/2020

PDE-based Group Equivariant Convolutional Neural Networks

We present a PDE-based framework that generalizes Group equivariant Conv...
04/28/2020

3D Solid Spherical Bispectrum CNNs for Biomedical Texture Analysis

Locally Rotation Invariant (LRI) operators have shown great potential in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

CNNs naturally benefit from the translation equivariance of the convolution operation. They have been used in various studies to analyze static or dynamic textures by incorporating an orderless pooling of feature maps to discard the overall shape and layout analysis and, thus, describe repetitive texture patterns [1, 2, 3]

. Robustness to other groups of transformations, including rotation and reflection, have been traditionally approximated by showing various examples of transformed data, e.g. in the form of naturally large training sets or artificial data augmentation. The cost of learning such transformations has been mitigated by transfer learning, i.e. re-using weights pre-trained on a very large dataset, typically ImageNet 

[4]

for 2D images. The features learned on such large datasets (e.g. early colors, edges and more complex shapes) have proven to generalize well to various computer vision tasks

[1, 2]

. However, such models also require larger Degrees of Freedom (DoF), and the type of invariance required to perform a given task may depend on the data and task. For instance, analyzing local patterns (e.g. tumor or organ walls, diverse vascular configurations) in medical images often require local rotation invariance. On the other hand, the scale is generally controlled and can become an informative feature that we do not want to discard by invariance. Besides, the amount of data available in medical imaging is often limited and cannot include all possible 3D orientations of all patterns relevant to the analysis task at hand. It follows that controlling the type of desired equivariances in a CNN is an important characteristic in terms of speed of convergence and data efficiency.

Hard-coding equivariance into CNNs has gained a strong interest in recent literature [5, 6, 7, 8, 9, 10, 11]. Group equivariant convolutions [5] (-convolutions) exploit weight sharing across transformation channels, in particular right-angle rotations and roto-reflections, expanding the expressive capacity of the network with a limited increase in the number of parameters. The use of 2D steerability was also proposed [8, 12] for advanced rotation equivariance.

3D CNNs [13] will largely benefit from built-in rotation equivariance due to the larger degree of possible 3D rotations (i.e. three rotation axes) that must be otherwise learned via data augmentation and unnecessary additional DoFs. 3D -CNNs were recently developed for pulmonary nodule detection [9] in chest CT scans and for 3D object recognition and volumetric boundary segmentation [10].

In this paper, we evaluate the importance of equivariance and invariance to rotations for classifying 3D textures using classic and group equivariant CNN designs. We employ a controlled experimental setup with synthetic texture volumes having similar characteristics as organ and tumor tissue in medical imaging.

2 Methods

2.1 Group Equivariant CNN

A convolution layer uses the translation symmetry of grid data by sharing weights across spatial shifts, resulting in a translation equivariance () maintained throughout the network. This equivariance is generally pooled into a translation invariance by the last pooling and fully connected layers (). Here, acts on a stack of -dimensional grids, i.e. -D image or feature maps ( in this paper). The transformation operator implements, in this case, an -D translation. Yet, the weight sharing can be extended to larger transformation groups to exploit the symmetries of the data, including rotations and reflections. Standard convolutions and their group equivariant counterparts (-convolutions) were formally defined in Cohen and Welling [5]. A -convolution is equivariant to the translation group and to a symmetry group , together forming the larger group . In this paper, is the cube rotation group or roto-reflection 111The lowerscript here is not related to a transformation .. The -convolution output contains a number of feature maps , where is the number of elements in the group , and is the number of filters. The filters are typically transformed by all elements in and the result of their convolution with the input channels are organized as orientation channels. With this built-in equivariance, we have , i.e. a transform of the input by results in the same transform of the feature maps. The orientation channels of the filters, however, undergo a permutation to maintain the equivariance throughout the network as described in [9]. -pooling allows pooling the activations over the transformation channels to obtain local or global invariance to the symmetry group . It was shown in [5] that premature invariance in early layers was not desirable in their experiment, yet Local Rotation Invariance (LRI) can be useful for texture analysis as discussed in Section 4.

In the 3D case [9], we consider equivariance to orientation-preserving rotations of the cube, i.e. the group including elements, as well as the full symmetry group of the cube (roto-reflection), i.e. the group with elements. Note that in medical imaging, the pixel spacing in the -direction may differ from the planar one, e.g. in CT and MRI volumes. In this case, one would consider rectangular cuboid rotations ( and ).

2.2 Dataset

Various medical image modalities are characterized by 3D textures such as MRI and CT scans of most organs including the lung, liver and breast. In this paper, we focus on synthetic data to obtain a controlled setup with representative texture classes and known transformations in the training and test data. The RFAI (Reconnaissance de Formes, Analyse d’Images) database [14] contains 3D synthetic texture volumes having similar characteristics as organ and tumor tissue in medical imaging. We evaluate the methods on the Fourier and Geometric datasets of the RFAI database, composed of 15 and 25 classes respectively and 10 instances per class. The limited amount of data is representative of many medical imaging tasks due to patient privacy and to the cost of data annotation. The Geometric dataset was constructed using random positioning of geometric shapes such as spheres, cubes, and ellipses, while the Fourier dataset exhibits more directional textures and was constructed from synthetic distributions in the Fourier domain. Each instance is a 3D image of voxels. Fig. 1 illustrates one volume instance of each class.

For each instance of the datasets, a rotated version is provided [14] to evaluate and compare the rotation invariance of the algorithms. Finally, we introduce another rotated test set that we refer to as -rotated as the rotations applied to the texture volumes are randomly chosen from the group of cube symmetries

(i.e. right-angle rotations). We keep the same random rotations for all experiments for a fair comparison. The data is normalized with zero mean and unit variance.

Figure 1: The Fourier (left) and Geometric (right) RFAI datasets are composed of 15 and 25 classes respectively. Each instance is a 3D image of voxels.

We evaluate the method with a k-fold Cross-Validation (CV), where for the Fourier dataset and for the Geometric one. The random splits are kept unchanged for all experiments. Results based on a leave-one-out (LOO) CV as reported in [15, 14] should be similar or better than the ones reported in this paper as the LOO CV uses more training data for each fold. For each run, we train using the normal set ( folds) and test either on the normal, -rotate or rotate sets for the remaining fold. Data augmentation is not performed since we want to evaluate the contribution of group equivariance in the context of a small amount of controlled training data.

2.3 Network Architectures

As a baseline, we used a standard CNN without group equivariance that we refer to as -CNN, summarized in Fig. 2. The -CNN is composed of three convolution layers with 48, 96 and 192 output feature maps respectively. The first and second convolutions are down-sampled by max-pooling operations. We average the feature maps across the spatial locations (global average pooling) after the third convolution layer in order to learn and densely pool repeated texture features [1]

. A first dense layer (i.e. fully connected, FC) with 1024 neurons is used, as well as an output dense layer with

neurons (

being the number of classes). The convolutional and dense layers are activated by a ReLU, except for the output dense layer activated by a softmax function.

We then derive several group equivariant networks from the -CNN baseline. After the global average pooling, we can directly connect the orientation selective features maps to the fully connected layers (-CNN). We can also use maximum or average -pooling (-CNN and -CNN) to pool across the orientations. We do not incorporate -pooling in early layers as it has been shown to reduce performance [5]. In order to maintain the number of free parameters roughly equal across networks (except for the -CNNs as reported in Table 1), we divide the number of filters in the convolution layers by , with and .

Figure 2: The -CNN architecture. For the -CNNs, the architecture is similar but with a number of filters divided by and the feature maps are extended to orientation channels.

The weights are initialized with the Xavier method [16]

. An Adam optimizer is used with standard hyperparameters (

, and ) to optimize the cross-entropy loss. The initial learning rate is set to and a batch size of 16 is used.

3 Experimental Results

The results are summarized in Table 1. For such a small dataset, shallow hand-crafted features tailored to rotation invariance [15] naturally perform better than the complex 3D CNNs with only equivariance to rotations in and . The goal of these experiments is not to compare against these methods as a 3D CNN with pre-trained weights and data augmentation would easily obtain around accuracy. The point is rather to evaluate the built-in equivariance and invariance of these networks. The confusion matrices of the -CNN and -CNN on the Fourier -rotate are reported in Fig. 1.

Network (# param.) F. norm. F. -rot. F. rot. G. norm. G. -rot. G. rot.
-CNN (271,039) 94.0 70.0 62.7 88.8 86.4 77.6
-CNN (670,054) 95.3 49.3 54.0 89.6 78.8
-CNN (199,014) 96.0 95.3 96.0 87.6
-CNN (199,014) 92.7 92.0 70.7 96.0 96.8
-CNN (1,020,549) 94.7 52.7 48.7 95.2 88.3 81.2
-CNN (250,501) 75.4 95.2 96.8 88.3
-CNN (250,501) 89.7 88.8 70.2 92.4 92.0 86.4
Table 1:

Results of the k-fold overall accuracy and standard deviation (

) on the Fourier (F.) and Geometric (G.) RFAI datasets. Chance level accuracy is 6.7 and 4, respectively. The test volumes are taken from either the normal (norm.), the -rotate (-rot.) or the rotate (rot.) datasets.
Figure 3: Confusion matrices on the Fourier -rotate set with the -CNN (left) and -CNN right).

4 Discussion

On the normal set, best results are obtained with the -CNNs using either no -pooling or max -pooling. It is worth noting that the -CNNs have fewer DoFs to model the diversity of texture patterns sought by the network. This is due to maintaining approximately the same number of free parameters by dividing by . In this way, we enforce the rotation equivariance by weight sharing and reduce the DoFs in the patterns learning. The rotation equivariance being unnecessary for the normal test set, it is remarkable that such high performance is obtained by the -CNNs, showcasing its ability to leverage data symmetries by sparing the parameter budget on data diversity.

The -CNN does not generalize well to the rotated datasets, yet it performs better on the -rotate set than the rotate one. In particular, the Geometric dataset containing various simple shapes such as cubes, the right-angle rotations result in less variation from the training set than the random rotations. The results on the -rotate set show that invariance to right-angle 3D rotations is obtained by the -CNN with a max or average -pooling, although the latter reduces the discriminability by averaging across the orientations. The -CNN without -pooling is not invariant and performs only slightly better than the -CNN on the -rotate set. These results show that, with limited training data, equivariance is not sufficient and the invariance must be encoded with -pooling.

On the rotate set, the -CNNs outperform the -CNN, yet a significant drop of accuracy is observed as compared to the -rotate set, suggesting that the right-angle symmetries do not extract information in enough orientations to sufficiently describe the possible patterns orientations. This result motivates the further use of higher order rotational symmetries and steerable filters as previously proposed in 2D [8] and recently in 3D [11]. The poor generalization of the -CNN on the rotate set can also be explained by the lack of variability in the directionality of the training textures (all volumes of a class have the same directionality in the normal set), resulting in very directionally selective filters. This precision is useful for the normal set, yet any small rotation of the textures results in a mismatch with the convolution filters. Based on the analysis of confusion matrices, we observed that textures with directionality benefit more from the rotation invariance than non-directional textures. To illustrate this, we report the confusion matrices of the -CNN and -CNN on the Fourier -rotate set and compare the accuracy on the different texture classes shown in Fig. 1. Volume examples of the classes are shown in Fig. 1 with the corresponding class numbers (c1 to c15). Non-directional texture classes such as c6 and c15 are already correctly classified by the -CNN, while highly directional textures such as c10, c11 and c12, benefit greatly from the invariance. The directionality of the textures of class c13 is aligned with two rotation axes (see Fig. 1). Therefore, the -CNN is less affected by right-angle rotations of the volumes as they are likely to result in the same texture directionality (7 out of 10 correctly classified).

The equivariance does not result in a significant increase of accuracy as compared to the -CNN. This is due to the -rotate and rotate sets being obtained with orientation preserving rotations. For this task, the enforced equivariance to reflections is unnecessary and reduces further the number of filter patterns that the network can learn.

Combining Directional Sensitivity (DS) and LRI is an important aspect of 3D texture analysis in real medical data [15]. With the -CNNs, DS is learned throughout training which is made easier with the weight sharing across filters orientations. The LRI can be obtained by performing the -pooling before global average pooling, with a local support equivalent to the effective receptive field of the neurons at the orientation pooling layer. We are currently investigating the LRI for right-angle rotations at different scales using -pooling at various depths. On the Fourier and Geometric datasets, the LRI before global average pooling for right-angle rotations with max -pooling results in a small drop of accuracy (approximately 2% drop) on the normal and -rotate sets.

5 Conclusion

In this paper, we evaluated the use of -CNNs for the classification of 3D synthetic textures representative of rotated texture patterns in medical imaging. We experimentally confirmed the importance of using CNN designs including built-in equivariance and invariance (when using -pooling) to the group, which can be easily extended to other symmetry groups. The results showed that on this simple dataset, the equivariance to right-angle rotations is very useful but yet not sufficient to analyze randomly rotated textures. This observation, therefore, motivates the use of a higher order rotational symmetry, as proposed in 2D CNNs with steerable filters [8, 12] and recently in 3D [11]. In 3D, the model size, the computational complexity of 3D filters and of feature maps and the number of symmetries to encode are significantly larger than in 2D, resulting in a considerable bottleneck in terms of GPU resources.

Acknowledgements

This work was supported by the Swiss National Science Foundation (grants PZ00P2_154891 and 205320_179069).

References