I Introduction
The analysis of subcortical structures and pathological regions in brain Magnetic Resonance (MR) images is crucial in clinical diagnosis, treatment plan and postoperation assessment. Taking the Hippocampus for an example, the segmentation of this subcortical structure in brain MR images has been employed to predict the progression of Alzheimer’s disease (AD). AD is the 6th leading cause of deaths in the United States, and it is estimated that there are approximately 5.5 million Americans living with AD in 2017
[1]. Besides the image segmentation of brain anatomical structures, segmenting some pathological regions, such as ischemic stroke lesion, is also invaluable in clinical decisions. Stroke is the 5th leading cause of deaths in the United States and kills more than 130,000 Americans each year [2]. The labeling difficulties stem from the irregularity of stoke lesion shape and unpredictability of its location, which makes it difficult to model its shape and acquire prior knowledge about its location.Recently deep learning techniques, such as Convolutional Neural Networks (CNN), have brought significant improvements in image labeling. The techniques evolve from image classification to semantic segmentation. For general image classification, it makes an inference about the image category of the input image based on achieved abstraction, i.e., assigning one specific label to the whole image. A large number of research works have been done to improve the classification accuracy
[3, 4, 5] and some algorithms can even approach or outperform human beings [6, 7]. As for image (semantic) segmentation, a correct label has to be assigned to each pixel based on the learned features. The elegant classification networks can help with the pixel label estimation by sliding the input patch across the image. It is a conventional and accurate way to predict the label for the center pixel based on the content abstractions from the patch [8, 9]. However, these patchwise methods suffer from the expensive computation burden due to the dense prediction. To deal with these problems, the trick of shifting input and interlacing output was introduced in OverFeat [10], which applies convolution kernels directly on the whole image rather than fixsized patches. Some other imagewise methods for image segmentation have recently been proposed based on fully convolutional network (FCN) [11, 12], by transforming the fullyconnected layers in pretrained classification network into convolutional layers.Despite the progress of CNN in general image analysis, it is still challenging to apply these methods directly into brain MR image analysis, as these medical images are usually 3D volumes with poor contrast condition. To utilize CNN on 3D image analysis, the conventional way applies the 2D CNN network on each image slice (axial plane), and then concatenates the results along third image direction. Directly applying 2D convolution on 3D volumes will make the temporal information collapsed during the convolution process. To learn spatiotemporal features, 3D convolution is recently introduced in video analysis tasks [13, 14]. Given the expensive computation cost, the size of convolution kernels is usually set to a small number in practice, which can only capture shortterm dependencies.
For image segmentation with CNN, the classic architecture is fully convolutional network (FCN) [11]. Due to the large receptive fields and pooling layers, FCN tends to produce segmentations that are poorly localized around object boundaries. Therefore, the deep learning outcomes are usually combined with probabilistic graphical models to further refine the segmentation results. Fully connected CRF [15] is one commonly used graphic model during the FCN postprocessing [12], where each image pixel is treated as one graph node and densely connected to the rest graph nodes. Rather than utilizing the colorbased affinity like fully connected CRF, boundary neural fields (BNF) [16] first predicts object boundaries with FCN feature maps and then encodes the boundary information in the pairwise potential to enhance the semantic segmentation quality. However, given the massive pixel amount and poor contrast condition in brain MR images, it is different to apply these methods directly to 3D brain image segmentation.
To address the above challenges, in this paper, we extract longterm dependencies in spatialtemporal information with convolutional LSTM [17, 18]. One novel randomized connection network is designed, which is a dynamic directed acyclic graph with symmetric architecture. Through the randomized connection, the deep network behaves like ensembles of multiple networks, which reduces the dependency between layers and increases the network capacity. To obtain the comprehensive properties for 3D brain image, both convolutional LSTM and 3D convolution are employed as the network units to capture longterm and shortterm spatialtemporal information independently. Their results are assembled and refined together with the proposed graphbased node selection and label inference. Experiments have been carried out on the publicly available databases and our method can obtain quality segmentation results.
Note that the preliminary version of this work has be presented in the 3rd Workshop on Deep Learning in Medical Image Analysis, in conjunction with MICCAI 2017. In this paper, 1) we extend our previous work by introducing the design of randomized connection and network units in detail; 2) additional mathematical equations, solutions together with illustrative examples are given in this work; 3) intensive experiments have been carried out to evaluate each component of our proposed method and comprehensive evaluations have been done with the stateoftheart methods.
Ii Methodology
In this section, we first introduce two kinds of network units: 3D convolution and convolutional LSTM, to capture shortterm and longterm spatialtemporal information respectively. Then one novel symmetric network with randomized connection is presented as the architecture design. Graphbased node selection and label inference are further proposed to refine the labeling results efficiently.
Iia 3D Convolution
Convolutional Neural Network (CNN) is a widely used deep learning technique in computer vision tasks, such as image classification, object detection and semantic segmentation. There are two basic components in CNN: convolution and pooling layer, as shown in Fig.
1. To compute the pixel values in one layer, those pixels within the corresponding local region from its last layer (namely receptive field) are employed as input. For example, the convolutional response in layer can be estimated as follows:(1) 
where is the input from the receptive field (Red region in layer ), is the weight matrix and is the bias associated with the convolutional kernel. As for the nonlinear activation
, it can be a traditional sigmoid or hyperbolic tangent function, or Rectified Linear Unit (ReLU)
[19]. As displayed in Fig. 1, one pair of and is corresponding to one feature map, and each feature map only has one single image. The receptive field size for convolution layer is , where is the number of feature maps in the previous layer and represents the 2D convolutional kernel size. Since the convolutional kernels operate on a local neighborhood rather a single pixel, the spatial information can be captured and encoded in CNN. To obtain a more abstract feature representation, pooling layer is usually placed after convolution layer and the pooling strategy can be maximum or average pooling. From Fig. 1, it can be noticed that the pooling operation can only shrink the size of feature maps, while leave their amounts unchanged.To utilize CNN on 3D image analysis or video processing, the conventional way is first to apply the 2D CNN network on each image slice or frame, and then to concatenate the results along the third image direction or the time axis. Directly applying 2D convolution on 3D volumes will lead to the collapse of temporal information, since all frames (images) in the previous layer will result in one image. To learn spatialtemporal features, 3D convolution is recently introduced in video analysis tasks [13, 14]. The distinction between 2D and 3D convolution is illustrated in Fig. 2. With 2D convolution, the size of receptive field is , where is the number of feature maps in the previous layer and is the number of images in each feature map. Using 3D convolution, the size of receptive field becomes , where is along the third image direction (time axis) and . The 3D convolutional response in layer can be estimated in the following way:
(2) 
In this paper, we employ the ReLU as the nonlinear activation in 3D convolution.
For both 2D and 3D convolutions, one pair of and is still corresponding to one feature map. While after 2D convolution, each feature map has only one single image (as shown in Fig. 2(a), Layer ), which leads to the loss of temporal information. For the feature map after 3D convolution, it still has multiple images and can keep tracking of temporary property. However, due to the expensive computation, the value of is usually assigned with a small number in practice ( is often set to in 3D convolution, ), which is suitable to capture shortterm dependencies.
IiB Convolutional LSTM
Recurrent Neural Network (RNN) is another popular approach to collect temporal information, which is widely used in speech recognition and natural language processing. As displayed in Fig. 3, there is a loop inside RNN, which makes it inherently suitable for sequential modeling. To estimate the current hidden states , it depends on both the current input and the previous hidden states :
(3) 
If the output is required, it can be calculated as follows:
(4) 
, , and denote the inputtohidden, hiddentohidden, and hiddentooutput weight matrices, and and are the corresponding biases. Analogous to CNN,
is the nonlinear activation function. Although the previous hidden state information is encoded in RNN, it is incapable of modeling longterm dependencies in long sequences, since the signal decreases exponentially over time steps
[20]. Moreover, RNN suffers from the problem of gradient vanishing or exploding, which makes the optimization difficult.To deal with the above problem, LongShort Term Memory (LSTM)
[21] is proposed with a more complex neural network block to control information flow in a special way, as demonstrated in Fig. 4. The key component in LSTM is the memory cell state , which carries information through the entire chain with some minor linear operations. This memory cell can be accessed and updated by three gates: forget gate , input gate and output gate . The forget gate decides how much information to be thrown away from the past cell state and the input gate determines the information to be accumulated into the latest cell state . As for the output gate , it controls the information propagation from the memory cell to the hidden state . Their detailed formulations are given as follows:(5) 
where and refer to the sigmoid and hyperbolic tangent functions respectively. The symbol stands for Hadamard product, , and denote the input weights, recurrent weights and biases respectively. Because of the memory cell and the gating mechanism, during backpropagation, the error can be trapped inside the memory cell (also referred as constant error carousels [21]) through many time steps and the gradient can be prevented from vanishing or exploding quickly.
In classic LSTM, fullyconnected transformations are employed during the inputtostate and statetostate transitions. As such, the spatial property is ignored. To gather the spatialtemporal information, convolutional LSTM (ConvLSTM) is recently proposed [17, 18] to replace the fullyconnected transformation with the local convolution operation.
In this paper, we utilize ConvLSTM to collect the longterm dependencies for 3D images, where the third image axis is treated as temporal dimension. The ConvLSTM for 3D image processing is illustrated in Fig. 5. To compute the pixel values in one layer, both those pixels within the corresponding local region from its last layer (at the same time stamp) and those from the current layer (at the previous time stamp) are employed as input. For example, the ConvLSTM response in layer (Purple pixel) can be estimated as follows:
(6) 
where denotes the convolution operation, the symbol stands for Hadamard product, and refer to the sigmoid and hyperbolic tangent functions respectively. As shown in Fig. 5, is the input from last layer at the same time stamp (Red regions) and is the input from current layer at the previous time stamp (Green regions). and denote the inputtohidden and hiddentohidden weight matrices, with as the corresponding biases. Distinct with the weight matrices in classical LSTM, the input and recurrent weights
in ConvLSTM are all 4D tensors, with a size of
and respectively, where is the predefined number of convolution kernels (feature maps in the current layer), is the number of feature maps in the previous layer and is the convolutional kernel size. ConvLSTM can be regarded as a generalized version of classic LSTM, with the last two tensor dimensions equal to .IiC Randomized Connection Network
With 3D convolution and ConvLSTM settled as network units to capture comprehensive spatialtemporal information, the next consideration is the design of the whole network architecture. Fully convolutional network (FCN) [11]
is a classic deep learning network for image segmentation, by transforming the fullyconnected layers in pretrained classification network into convolutional layers. To extract abstract features, poolings operations are indispensable in FCN, which leads to the significant size difference between estimated probability map and the original input image. It is necessary to employ extra upsampling or interpolation steps to make up the size difference, while the segmentation quality through one direct upsampling can be unacceptably rough. To address this problem, the network architecture of FCN turns from a line topology into a directed acyclic graph (DAG), by adding links to append lower layers with higher resolution into the final prediction layer. UNet
[22], is another DAG with symmetric contracting and expanding architecture, which has gained great success in biomedical image segmentation. 3D UNet [23] is recently introduced for volumetric segmentation by replacing 2D convolution with 3D convolution.Inspired by the improvements in biomedical image analysis using UNet, in this paper, we also keep the symmetric contracting and expanding structure for 3D brain image segmentation, with detailed network shown in Fig. 6. The 3D convolution/ConvLSTM (Black arrow) is employed to capture the shortterm or longterm spatialtemporal properties. The Green arrows refer to the pooling or upsampling operations. Distinct with UNet where all connections are fixed (static DAG), in the proposed method, the connection between contracting and expansive paths (Red arrow) is randomly established during training (dynamic DAG). To further illustrate the concept, we use one layer as an example to analyze its input and output. For the th layer with randomized connection (Grey dashed square) along the expansive path, its output can be estimated as:
(7) 
where is the input from the previous layer along the expansive path, is the upsampling operation, and the input from corresponding layer along the contracting path. is a randomized function whose result is with the probability , and with the probability . During training, the input will be added to th layer with the probability in each iteration.
It is worth noting that randomized connection is different from dropout, although both of them are trying to enforce regularization on the deep networks to decrease overfitting during training. Dropout intends to prevent the coadaptation of neurons in neural networks, by randomly selecting a subset of units and setting their outputs to zero. While the proposed randomized connection intends to reduce the dependency between layers and to increase the model capacity. By randomly dropping the summation connection, the layers can be fully activated and forced to learn instead of relying on the previous ones.
Randomized connection achieves great robustness and efficiency because it reduces dependency between layers and increases the model capacity. By randomly dropping the summation connection, the layers can be fully activated and forced to learn instead of relying on the previous ones. As discussed in [24], residual network with identity skipconnections behaves like ensembles of relatively shallow networks. In the proposed method, the summation connection is randomly established in every iteration, so a number of different models are assembled implicitly during training. If there are connections linking the two paths, then it will be models combined in the training process. In the proposed method, two randomized connection networks are trained independently, with ConvLSTM and 3D convolution as network unit to capture longterm and shortterm spatialtemporal information respectively.
IiD Graphbased Label Inference
Distinct with general images, which usually are 2D images and have relatively sharp object boundaries, the size of medical volumes is much larger and the boundary among tissues is quite blurry as a result of the poor contrast condition. Although fully connected CRF and BNF can boost the segmentation performance for general images, this kind of differences in image properties might lead to some problems if directly applying these methods on 3D medical image segmentation. Given that the typical size for brain MR images [25] can be , with fully connected CRF, the amount of connecting edges for one node becomes
. On one hand, dense connections on a huge graph can suffer from heavy computation burden during optimization. On the other hand, due to the similar histogram profiles among different tissues in medical images, the dense connection can incur extra outliers or generate spatially disjoint predictions. For the boundarybased method BNF, its application gets hindered by the poor contrast condition in brain images. As such, it is necessary to design effective graphbased inference method for 3D brain image segmentation.
The proposed graphbased label inference method involves two steps: node selection and label inference. For the sake of efficiency, it is better to prune the majority of pixels and to focus on those whose results need to be refined. The node selection and label inference are introduced based on the fundamental graph , where node set includes all pixels in the 3D image and edge set corresponds to the image lattice connection. If and are adjacent in the 3D image, an edge will be set up, with as edge weight. As both longterm and shortterm spatialtemporal information are desirable in the node selection process, the labeling results estimated by ConvLSTM and those by 3D convolution need to be employed collaboratively. Note that the examples and figures in this subsection are just for simplification to use two network results. In fact, the node selection and label inference are not limited to the number of networks.
For each node , it can be represented as , where refers to the probability estimated by the th deep learning network that belongs to foreground. During node selection, two criteria are taken into consideration: the label confidence of each node and the label consistency among the neighborhood. We want to filter out those nodes with high label confidence and consistency, so that we can focus on the rest nodes for further processing. In Fig. 3, two small image cubes are extracted from two result images for illustration. For the node in the th result image (Yellow node), its confidence is evaluated by the contrast between foreground and background probability, with the definition as follows:
(8) 
As for the consistency, it is measured by the cosine similarity between neighboring nodes, defined as:
(9) 
where , includes the 6nearest neighbors in th result image (Blue node) and the corresponding nodes in the rest of the images (Yellow node).
The two criteria are combined together for nodes selection and the detailed formulation is given as follows:
(10) 
where is the predefined threshold, indicating the percentage of nodes to be pruned, and is the set of confident nodes that can be pruned. The first unary term measures the label confidence and the second pairwise term accesses the label consistency. Equation (10) can be solved efficiently by sorting the energy for each in descending order and then set the first nodes as confident nodes. The rest of the nodes are treated as candidate nodes and need further label inference.
The label inference is developed on a compact graph , where is candidate node set and is the lattice edge connecting candidate nodes. The inference problem is formulated under the Random Walker framework [26], with detailed definition given as follows:
(11) 
where is the probability that node belongs to the foreground, and refers to the foreground and background seed respectively, as shown in Fig. 3. In the first unary term, and are the priors from deep learning network, which are assigned with and . In the second pairwise term, is the edge weight for lattice connection (Blue dashed line), which is estimated by conventional Gaussian function:
(12) 
where is the intensity value and is a tuning parameter. By minimizing Equation (11), the probability for each candidate node can be obtained and the label can be then updated correspondingly: if and otherwise.
Iii Experiments
In this paper, experiments have been carried out on two publicly available brain MR image databases – LPBA40 [25] and Ischemic Stroke Lesion Segmentation (ISLES) Challenge 2016 [27].
Iiia Segmentation Results on LPBA40
LPBA40 has 40 volumes with 56 structures delineated. The database was randomly divided into two equal sets for training and testing respectively. Data augmentation with elastic transformation was performed to increase the amount of training data by 20 times and the training process was set to 60 epochs, with a learning rate of
. The rest of the parameter settings used in the experiments are listed as follows: the probability for randomized connection , the percentage to prune nodes and the tuning parameter in Gaussian function . Standard dropout regularization has also been utilized for the proposed networks and all the compared methods in the experiments.Recently several softwares have been available to provide the automatic segmentation function for brain MR images, such as BrainSuite [28] or FreeSurfer [29]. During evaluation, we utilized BrainSuite, one of the available softwares, to segment images in the LPBA40 databases as a reference. BrainSuite first runs surface/volume registration based on the extra highresolution () BCIDNI_brain atlas and then warps the label map from the atlas to the target image. In the experiments, FCN was employed as the baseline, where one patchbased classification network was first trained and then adapted to imagebased segmentation network by transforming fully connected layer to convolutional layer. Besides the reference BrainSuite and the baseline FCN, we also compared with the stateoftheart methods – symmetric UNet with fixed connection using 2D and 3D as network units.
Dice Coefficient (DC) is utilized to measure the quality of segmentation results. The quantitative results on available subcortical structures are given in Table I, with the highest values shown in Red. Each subcortical structure has two parts (located in the left and right hemisphere), and the results are provided for the leftright part respectively, separated by the hyphen. The intermediate results generated by randomized connection network using 3D Convolution and ConvLSTM are provided in this table. Although BrainSuite utilizes a highresolution atlas, those deep learning based methods (FCN, UNet and our method) which rely on the lowresolution atlases inside the database, obtain much better performances.
In the experiments, the network architectures for 3D UNet and 3D Convolution Random Net are kept the same, including the number of feature maps, kernel size and the employment of standard dropout, except the fixed connection and the proposed randomized connection. From the comparison of quantitative results between them, it shows that the proposed randomized connection can help improve the segmentation significantly by 1.26% on the LPBA40 database.
As compared with conventional FCN and UNet, randomized connection networks can obtain better results. Through graphbased label inference, the longterm and shortterm information can be assembled together to further improve the performance. Some visual results are shown in Fig. 4, with outliers circled in Red. The first column is the ground truth for reference, the second and third columns are results from ConvLSTM and 3D convolution randomized connection networks. As displayed in the fourth column, these outliers in randomized networks can be removed after graphbased label inference.
IiiB Segmentation Results on ISLES Challenge 2016
In the ISLES Challenge 2016, it has two tasks: the segmentation of stroke lesion volumes and the regression of clinical mRM score. There are 30 cases in the training dataset and 19 cases in the testing dataset. In the experiments, we focus on the segmentation of ischemic stroke lesion regions. Data augmentation with elastic transformation was performed to increase the amount of training data by 20 times and the rest parameter settings were kept as the same with those in the LPBA40 database.
As compared with the labeling of subcortical structures, the segmentation of ischemic stroke lesion is more challenging, as the shape and position of pathological regions are not predictable. Many methods cannot successfully to label all the 19 cases in test dataset. In Table II, we list several teams which have finished the labeling of all the 19 cases, measured with DC and our team ranks the 3rd in this list. (Here we only consider the results which can finish all the labeling of 19 cases, while the list on the chanllendge website also includes other dice results which cannot finish the 19 cases.) As compared with the patchwise segmentation method employed by the 1st team, the performance of our proposed method is competitive and our imagewise segmentation is more efficient as it can provide the labeling map for the 3D image directly.
IiiC Further Discussion
For the evaluation of our proposed graphbased method, a small scale of experiments using fully connected CRF was performed to postprocess deep learning outcomes by 3D UNet on the LPBA40 database. We employed the publicly available implementation of fully connected CRF for processing 3D images [15, 30]. Some visual results and the quality improvements brought by fully connected CRF measured with DC are given in Fig. 9 and Fig. 10, respectively. As shown in Fig. 9, the first and fourth columns are the ground truth and our graphbased label inference result for reference. The second column displays the labeling results by 3D UNet, and some outliers still exist even after the postprocessing with fully connected CRF in the third column. This might be caused by the poor contrast condition and similar histograms among tissues in brain MR images (discussed in detail in Section IID). From Fig. 10, we can observe that the improvements brought by fully connected CRF after 3D UNet is limited, only 0.08% measured by DC. As for the running time to label one subcortical structure in one target image, it takes around 1 minute using the fully connected CRF (on a 3.2GHz, QuadCore CPU with 8GB RAM machine), as compared with 4 seconds using our graphbased label inference.
To test the performance of our proposed randomized connection, the comparisons between randomized connection networks and corresponding symmetric UNet with fixed connection are shown in Fig. 11. In the upper figure, the comparison between 3D UNet and 3D Convolution Random Net indicates that the randomized connection can improve the labeling quality by 1.26%. In the bottom figure, the comparison between ConvLSTM UNet and ConvLSTM Random Net indicates that the randomized connection can improve the labeling quality significantly by 2.03%.
Iv Conclusion
In this paper, a novel deep network with randomized connection is proposed for 3D brain image segmentation, with ConvLSTM and 3D convolution network units to capture longterm and shortterm spatialtemporal information respectively. The proposed randomized connection is able to enforce regularization on the deep networks to decrease overfitting during training, by controlling the connections among layers. To determine the label for each pixel efficiently, the graphbased node selection is introduced to prune the majority quality nodes and to focus on the nodes that really need further label inference. The longterm and shortterm dependencies are encoded to the graph as priors and utilized collaboratively in the graphbased inference. Experiments carried out on the publicly available databases indicate that our method can obtain the competitive performance as compared with other stateoftheart methods.
References
 [1] http://www.alz.org/facts/overview.asp.
 [2] https://www.cdc.gov/stroke/facts.htm.

[3]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems, 2012, pp. 1097–1105.  [4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke et al., “Going deeper with convolutions,” CoRR, vol. abs/1409.4842, 2014.
 [5] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [6] C. Lu and X. Tang, “Surpassing humanlevel face verification performance on lfw with gaussianface,” arXiv preprint arXiv:1404.3840, 2014.
 [7] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” arXiv preprint arXiv:1502.01852, 2015.
 [8] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1915–1929, 2013.
 [9] P. H. Pinheiro and R. Collobert, “Recurrent convolutional neural networks for scene parsing,” CoRR, vol. abs/1306.2795, 2013.
 [10] P. Sermanet, D. Eigen, X. Zhang et al., “Overfeat: Integrated recognition, localization and detection using convolutional networks,” CoRR, vol. abs/1312.6229, 2013.
 [11] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE CVPR, 2015, pp. 3431–3440.
 [12] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, 2014.
 [13] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221–231, 2013.
 [14] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in IEEE International Conference on Computer Vision. IEEE, 2015, pp. 4489–4497.
 [15] V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” Advances in Neural Information Processing Systems, 2011.

[16]
G. Bertasius, J. Shi, and L. Torresani, “Semantic segmentation with boundary
neural fields,” in
Computer Vision and Pattern Recognition
. IEEE, 2016, pp. 3602–3610. 
[17]
S. Xingjian, Z. Chen, H. Wang, D.Y. Yeung, W.k. Wong, and W.c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” in
Advances in Neural Information Processing Systems, 2015, pp. 802–810.  [18] V. Patraucean, A. Handa, and R. Cipolla, “Spatiotemporal video autoencoder with differentiable memory,” arXiv preprint arXiv:1511.06309, 2015.

[19]
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in
International Conference on Machine Learning, 2010, pp. 807–814.  [20] Y. Bengio, P. Simard, and P. Frasconi, “Learning longterm dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.
 [21] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [22] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2015, pp. 234–241.
 [23] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d unet: learning dense volumetric segmentation from sparse annotation,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2016, pp. 424–432.
 [24] A. Veit, M. J. Wilber, and S. Belongie, “Residual networks behave like ensembles of relatively shallow networks,” in Advances in Neural Information Processing Systems, 2016, pp. 550–558.
 [25] D. W. Shattuck, M. Mirza, V. Adisetiyo, C. Hojatkashani, G. Salamon, K. L. Narr, R. A. Poldrack, R. M. Bilder, and A. W. Toga, “Construction of a 3d probabilistic atlas of human cortical structures,” NeuroImage, vol. 39, no. 3, pp. 1064–1080, 2008.
 [26] L. Grady, “Random walks for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 11, pp. 1768–1783, 2006.
 [27] O. Maier, B. H. Menze, J. von der Gablentz, L. Häni, M. P. Heinrich, M. Liebrand, S. Winzeck, A. Basit, P. Bentley, L. Chen et al., “Isles 2015a public evaluation benchmark for ischemic stroke lesion segmentation from multispectral mri,” Medical image analysis, vol. 35, pp. 250–269, 2017.
 [28] D. W. Shattuck and R. M. Leahy, “Brainsuite: an automated cortical surface identification tool,” Medical image analysis, vol. 6, no. 2, pp. 129–142, 2002.
 [29] B. Fischl, “Freesurfer,” Neuroimage, vol. 62, no. 2, pp. 774–781, 2012.
 [30] https://github.com/Kamnitsask/dense3dCrf.