Analysis of cytoarchitectonic cortical areas in high-resolution histological images of brain sections is essential for identifying the segregation of the human brain into cortical areas and nuclei . Areas can be distinguished based on their specific architecture, the presence of cell clusters and specific cell types according to their morphology, the visibility of columns, and other features. Borders between cortical areas can be identified in a reproducible manner by a well-accepted method, which relies on image analysis and multivariate statistical tools to capture maximal changes in the distribution of cell bodies from the cortical surface to the white matter border . However, a completely automatic method that allows area segmentation in a large series of human brain sections is still missing.
Automatic identification of areas in cell-body stained whole brain sections is an extremely challenging segmentation task considering staining and sectioning artifacts, different relative orientations of the sectioning plane wrt. the brain surface, and high inter-subject variability (Fig. 1b, 1c). To reliably identify differences in the distribution of cell bodies in the cortex, automatic methods need to rely on high resolutions (1–) and a large field of view (approx. ) at the same time. Since expert annotation of brain regions is very labor-intensive, the amount of training data available for automatic algorithms is limited.
Previously, we have shown that despite these limitations, it is possible to employ Convolutional Neural Networks (CNNs) for segmentation of cytoarchitectonic areas . However, the performance of this model is not yet accurate enough for fully automatic segmentation.
In this work, we introduce a way to bypass the limitation of labeled training data by exploiting unlabeled high resolution cytoarchitectonic sections. We formulate a self-supervised auxiliary task based on the estimation of spatial distances between image patches sampled from the same brain, using a Siamese network architecture. In particular, we determine the approximate geodesic distance between two image patches by exploiting the inherent 3D structure of a whole brain 3D reconstruction (Fig.1a), as provided, e.g., by the BigBrain .
We make the following contributions: 1) By applying transfer learning, we significantly improve the accuracy of area classification in the visual cortex. 2) Carefully examining the training objective, we show that the Siamese network gains significant performance when predicting absolute 3D coordinates in addition to the pairwise distances. 3) We show that the self-supervised model learns to identify anatomically plausible cytoarchitectonic borders, although it was never trained to develop a concept of brain areas.
2 Related Work
. In contrast to these works, we aim to improve cytoarchitectonic mapping in microscopic scans of human whole brain cell-body stained tissue sections. A CNN method for automatically classifying cortical areas in such data was introduced by. Building on a U-Net architecture , their key insight is to include prior knowledge in form of probabilistic atlas information to deal with the difficult variations in texture and limited training data. In this work, we improve upon their results by leveraging unlabeled data in a self-supervised approach.
Transfer Learning and Self-Supervised Learning.
In recent years, several studies have successfully used transfer learning to apply well performing models trained on the ImageNet dataset to new tasks. For the task at hand however, we require an unusually large receptive field and address a very specific data domain. This forces us to train a custom CNN architecture from scratch.
Siamese networks have been employed for learning highly nonlinear image features for keypoint matching, image retrieval, or object pose estimation[9, 5]. Leveraging spatial dependencies in input images or exploiting motion information contained in video, such features can be learned in a self-supervised manner [4, 11]. We take up this idea and leverage the 3D relationship between individual brain sections. In  a Siamese regression network for pose estimation is presented, combining targets for pose regression from a single image with prediction of distance in pose between two input images. Extending this approach, we include prediction of the 3D coordinates of input patches as an additional objective in our model, and explain the benefits gained by this modification.
Our aim is to improve the accuracy of the supervised area segmentation in . Since classical cytoarchitectonic mapping is an extremely time consuming expert task, we cannot easily overcome the problem of limited training data. We therefore propose to exploit unlabeled brain sections from a 3D reconstructed whole-brain volume for automatic brain mapping, of which much larger amounts can be acquired in reasonable time. Our method consists of two consecutive steps (Fig. 2): 1) Pre-train weights on a self-supervised task using a Siamese network. 2) Fine-tune from these weights on the area segmentation task using a small dataset with brain region labels.
3.1 Self-supervised Siamese Network on Auxiliary Distance Task
Considering a dataset of unlabeled brain sections from one human brain , we formulate the self-supervised feature learning task: Given two input patches sampled randomly from the cortex in arbitrary sections, learn to predict the geodesic distance along the surface of the brain between these two patches (Fig. 1a).
We use a Siamese network that computes a regression function based on two input patches (Fig. 2). The network consists of two branches with identical CNN architecture and shared weights, computing features . The branch architecture corresponds to the texture filtering branch of the extended U-Net architecture of 
with a 32-channel dense layer added on top of the last convolutional layer. We define the predicted distance as the squared Euclidean distance between the feature vectors, and the distance loss as:
The groundtruth distance is computed by finding the closest points of the inputs on the brain surface and calculating their shortest distance along this surface. With this formulation of , the model learns a Euclidean feature embedding of inputs wrt. the geodesic distance along the brain surface.
We have successfully trained models using loss (1); however, our experiments have shown that convergence is faster and a higher accuracy regarding the predicted distances is reached, when we include the prediction of the 3D location of the inputs as an additional task in the training. To this end we add an additional dense layer calculating the predicted coordinate for each input based on and formulate the coordinate loss as follows:
with the 3D location of input on the inflated surface. By defining on the inflated surface, we ensure a high correlation between the distance of the coordinates and the geodesic distance of the inputs. For points on the right hemisphere, we reverse the left-right coordinate of to account for the essentially mirror symmetric topology of areas on the two hemispheres. The coordinate loss helps the network to learn a good feature embedding, agglomerating spatially close samples, even though they do not necessarily appear together as a pair during training.
The final training loss is a weighted combination of and together with a L2 weight regularization that regularizes all weights and biases, except those in the final dense layers:
3.1.1 Implementation Details
We generate our dataset from the BigBrain , a dataset of consecutive histological cell-body stained sections that were registered to a 3D volume at resolution. A surface mesh is available at resolution. We sample 200k 1019 patches at resolution from sections 0–3000 (occipital and parietal lobe, encompasses visual cortex), leaving out th of the sections for testing (cf. Fig. 1a for the sampling locations, Fig. 1b for example patches). To ensure that the laminar structure of the cortex is clearly visible, we only sample from the center of non-oblique cortex, i.e., where the cutting plane was degrees to the brain surface. From these samples we build 200k pairs in such a way that each patch occurs at least once, pairs always lie on the same hemisphere, and the resulting degree distribution of connections between pairs follows a power law. It would also be possible to include pairs across the hemispheres and calculate their distance by mapping points from the right hemisphere to the surface of the left hemisphere. We choose to stick to intrahemispheric distances due to interhemispheric differences in tissue size and area spread. We set the coordinate loss weight to 10 and the weight decay factor
to 0.001. Networks are trained for 16 epochs with SGD using an initial learning rate of 0.01, decaying by factor 2 every 3 epochs.
3.2 Fine-tuning the Extended U-Net on the Area Segmentation Task
For the area segmentation we use the extended U-net architecture proposed in 
. This model combines local image features extracted from high resolution image patches with a topological prior given by a cytoarchitectonic probabilistic atlas (http://jubrain.fz-juelich.de). For the two input types, the model has two separate downsampling branches that are joined before the upsampling branch. We use the same dataset as described in , comprising 111 cell-body stained sections from 4 different brains, partially annotated with 13 visual areas using the observer-independent method . For training, 2025 patches with resolution were randomly extracted from the dataset.
We initialize the texture filtering branch with the weights from the Siamese distance regression network. Compared to , we reduce the overall learning rate, but train the atlas data branch with a higher learning rate to account for the different initializations of the branches. In detail, we first train for 8k iterations without the atlas data followed by additional 10k iterations including the atlas information (batch size 40). Initial learning rates for these phases were 0.05 and 0.025 (0.25 for the atlas data branch), with learning rate decay at iterations 3k, 5k, and 6k by factor 2. Choosing a good learning rate was essential for the good performance of the fine-tuning.
We investigate the benefit of transfer learning from the self-supervised network on the original task of classifying brain areas. In particular, we show the influence of the different loss components. Furthermore, we demonstrate that the self-supervised network can distinguish several cytoarchitectonic areas without being explicitly trained on brain area classification. As evaluation metrics for the area segmentation task we report both the Dice score (harmonic mean of precision and recall) and the pixel distance errorthat assigns to each misclassified pixel a penalty depending on the distance to the nearest pixel that is of the misclassified class [12, 10]. For the self-supervised distance task the mean difference between the predicted and the groundtruth distances is reported.
Siamese Network Loss.
The loss functiondefined in Eq. (3) for the self-supervised network combines a distance loss with a coordinate loss . In order to evaluate the influence of each loss component, we trained a self-supervised model on 10% of the training set (20k samples) for 50 epochs. The model performs best when combining with (Fig. 5, rows 2-4 of the table). The inclusion of doubles the performance on the distance task, showing that has the expected effect of guiding the model towards a more representative feature embedding. When training only on the performance of the fine-tuned network is almost as good as training on the combined loss. However, the pixel distance error is then larger. A possible intuitive explanation is that allows the model to see more realistic relationships between samples of the cortex than , where distances between coordinates only approximately represent the geodesic distance. Thus the combined loss enables the model to better allocate individual samples in the feature embedding and make less errors on the area segmentation task. Training on the full dataset moderately increases performance on the area segmentation task.
Fine-tuned Area Segmentation Model. Compared to the randomly initialized network in , the Dice score increases to 0.80, while drops from 21.2 to 14.4. The drastic reduction of indicates that due to the pre-training, the model can locate patches more accurately in the brain and less often confuses spatially distant areas. Thus, the effect of pre-training on the distance regression task is similar to that of including the topological atlas prior in the supervised network . The confusion matrices (Fig. 5) and example segmentations (Fig. 6) reveal that the fine-tuned model predicts more areas reliably, and overall exhibits less noise in the segmentation.
Self-supervised Learning of Primary Visual Cortex. To better understand the feature embedding that the self-supervised network learns, we average blocks of nine neighboring feature vectors (each apart) and plot the squared Euclidean distances between neighboring averaged feature vectors. This way we can appreciate the differences that the model predicts between neighboring regions. There are three main factors that cause the model to see large differences between neighboring parts of the cortex: 1) Oblique parts of the cortex, 2) regions with high curvature, and 3) borders between cortical brain areas. The latter is particularly exciting: It shows us that the model actually discovered relevant properties of some cytoarchitectonic regions to solve the distance regression. In Fig. 6 we show that the network has correctly identified the border between hOc1/hOc2 and tracks it through several sections.
5 Discussion and Conclusion
Exploiting prior knowledge and the inherent structure of the data is beneficial for tasks with limited training data. Our experiments show that the self-supervised distance task is a suitable auxiliary task for classifying cortical brain areas. It significantly increases the Dice score which is a measure for the quality of the segmentation. In our evaluation, we have shown the importance of both components of our loss function to learn a good feature embedding. Additionally, we have demonstrated that our self-supervised model, trained with only the distances between samples as training signal, learns to identify several areal borders.
Inspired by this success, we plan to evaluate more auxiliary tasks based on inherent and relevant structures of 3D human brain reconstructions, such as local curvature or the relative orientation of the sectioning plane wrt. the brain surface and further evaluate the unsupervised features and their applicability towards identifying areal borders.
Acknowledgements. This work was partially supported by the Helmholtz Association through the Helmholtz Portfolio Theme “Supercomputing and Modeling for the Human Brain”, and by the European Union’s Horizon 2020 Framework Research and Innovation under Grant Agreement No. 7202070 (Human Brain Project SGA1). Computing time was granted by the John von Neumann Institute for Computing (NIC) and provided on the supercomputer JURECA at Jülich Supercomputing Centre (JSC).
-  Amunts, K., Lepage, C., Borgeat, L., Mohlberg, H., … Shah, N. J.: BigBrain: an ultrahigh-resolution 3D human brain model. Science, 1472–1475 (2013)
Amunts, K., Zilles, K.: Architectonic mapping of the human brain beyond Brodmann. Neuron, 1086–1107 (2015)
-  de Brébisson, A., Montana, G.: Deep neural networks for anatomical brain segmentation. CVPRW, 20–28 (2015)
-  Doersch, C., Gupta, A., Efros, A. A.: Unsupervised visual representation learning by context prediction. ICCV, 1422–1430 (2015)
-  Doumanoglou, A., Balntas, V., Kouskouridas R., Kim, T.: Siamese Regression Networks with Efficient mid-level Feature Extraction for 3D Object Pose Estimation. CoRR, abs/1607.02257 (2016)
-  Glasser, M. F., Coalson, T. S., Robinson, E. C., Hacker, C. D., … Smith, S. M.: A multi-modal parcellation of human cerebral cortex. Nature, 171–178 (2016)
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. MICCAI, 234–241 (2015)
-  Schleicher, A., Amunts, K., Geyer, S., Morosan, P., Zilles, K.: Observer-independent method for microstructural parcellation of cerebral cortex: a quantitative approach to cytoarchitectonics. Neuroimage, 165–177 (1999)
-  Simo-Serra, E., Trulls, E., Ferraz, L., … Moreno-Noguer, F.: Discriminative learning of deep convolutional feature point descriptors. ICCV, 118–126 (2015)
-  Spitzer, H., Amunts, K., Harmeling, S., Dickscheid, T.: Parcellation of visual cortex on high-resolution histological brain sections using convolutional neural networks. ISBI, 920–923 (2017)
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. ICCV (2015)
Yasnoff, W., Mui, J., Bacus, J.: Error measures for scene segmentation. Pattern recognition, 217–231 (1977)
-  Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks?. NIPS, 3320–3328 (2014)