1 Introduction
Live cell microscopy imaging is a powerful tool and an important part of the biological research process. The automatic annotation of the image sequences is crucial for the quantitative analysis of properties such as cell size, mobility, and protein levels. Recent image analysis approaches have shown the strengths of Convolutional Neural Networks (CNNs) which surpass stateoftheart methods in virtually all fields, such as object classification [1], detection [2], semantic segmentation [3], and many other tasks. Attempts at cell segmentation using CNNs include [4, 5, 6]
. All these methods, however, are trained on independent, non sequential, frames and do not incorporate any temporal information which can potentially facilitate segmentation in cases of neighboring cells that are hard to separate or when a cell partially vanishes. The use of temporal information by combining tracking information from individual cells to support segmentation decisions has been shown to improve results for non deep learning methods
[7, 8, 9, 10] but have not yet been extensively examined in a deep learning approaches.A Recurrent Neural Network (RNN) is an artificial neural network equipped with feedback connections. This unique architecture makes it suitable for the analysis of dynamic behavior. A special variant of RNNs is Long Short Term Memory (LSTM), which includes an internal memory state vector with gating operations thus stabilizing the training process
[11]. Common LSTM based applications include natural language processing (NLP)
[12], audio processing [13] and image captioning [14].Convolutional LSTMs (CLSTMs) accommodate locally spatial information in image sequences by replacing matrix multiplication with convolutions [15]. The CLSTM has recently been used to address the analysis of both temporal image sequences, such as next frame prediction [16], and volumetric data sets [17, 18]. In [18] CLSTM is applied in multiple directions for the segmentation of 3D data represented as a stack of 2D slices. Another approach for 3D brain structure segmentation is proposed in [17], where each slice is separately fed into a UNet architecture, and only the output then fed into bidirectional CLSTMs.
In this paper we introduce the integration of CLSTMs into an encoderdecoder structure (UNet) allowing compact spatiotemporal representations in multiple scales. We note that, unlike [17] which was designed and evaluated on 3D brain segmentation, the proposed novel architecture is an intertwined composition of the two concepts rather than a pipeline. Furthermore, since our method is designed for image sequence segmentation which can be very long the bidirectional CLSTM is not computationally feasible. Our framework is assessed using timelapse microscopy data where both cells’ dynamics and their spatial properties should be considered. Specifically, we tested our method on the Cell Tracking Challenge: http://www.celltrackingchallenge.net. Our method was ranked in the top three by the challenge organizers on the several submitted data sets, specifically on the fluorescent simulated dataset (FluoN2DHSIM+) and the differential interference contrast (DICC2DLHeLa) sequences which are difficult to segment.
The rest of the paper is organized as follows. Section 2 presents a probabilistic formulation of the problem and elaborates on the proposed network. Technical aspects are detailed in Section 3. In Section 4 we demonstrate the strength of our method, presenting stateoftheart cell segmentation results. We conclude in Section 5.
2 Methods
2.1 Network Architecture
The proposed network incorporates CLSTM [15] blocks into the UNet [6] architecture. This combination, as suggested here, is shown to be powerful. The UNet architecture, built as an encoderdecoder with skip connections, enables to extract meaningful descriptors at multiple image scales. However, this alone does not account for the cell specific dynamics that can significantly support the segmentation. The introduction of CLSTM blocks into the network allows considering past cell appearances at multiple scales by holding their compact representations in the CLSTM memory units. We propose here the incorporation of CLSTM layers in every scale of the encoder section of the UNet. Applying the CLSTM on multiple scales is essential for cell microscopy sequences (as opposed to brain slices as in [17]) since the frame to frame differences might be at different scales, depending on cells’ dynamics. Moreover, in contrast to brain volume segmentation [17] the microscopy sequence can be of arbitrary length, making the use of bidirectional LSTMs computationally impractical and the cells can move at different speeds and the changes are not normally smooth. The comparison to other alternatives is presented in Section 4.2. The network is fully convolutional and, therefore, can be used with any image size^{1}^{1}1
In order to avoid artefacts it is preferable to use image sizes which are multiples of eight due to the three maxpooling layers.
during both training and testing. Figure 1 illustrates the network architecture detailed in Section 3.2.2 Formulation
We address individual cells’ segmentation from microscopy sequences. The main challenge in this type of problems is not only foregroundbackground classification but also the separation of adjacent cells. We adopt the weighted distance loss as suggested by [6]. The loss is designed to enhance individual cells’ delineation by a partitioning of the dimensional (2 or 3) image domain into two classes: foreground and background, such that pixels which are near the boundaries of two adjacent cells are given higher importance. We set to denote these classes, respectively. Let be the input image sequence of length , where is a grayscale image. The network is composed of two sections of blocks each, the encoder recurrent block and the decoder block where are the network’s parameters. The input to the CLSTM encoder layer at time includes the downsampled output of the previous layer, the output of the current layer at the previous timestep and the CLSTM memory cell. We denote these three inputs as , , respectively. Formally we define:
(1) 
where,
(2) 
The inputs to the decoder layers are the upsampled ^{2}^{2}2
We use bilinear interpolation
output of the previous layer and the output of the corresponding layer from the encoder denoted by and respectively. We denote the decoder output as . Formally,(3) 
(4) 
We define a network with parameters as the composition of encoder blocks followed by decoder blocks, and denote . Note that the encoder blocks, , encode highlevel spatiotemporal features at multiple scales and the decoder blocks, , refines that information into a full scale segmentation map.
(5) 
We set the final output as a dimensional feature vector corresponding to each input pixel
. We define the segmentation as the pixel label probabilities using the softmax equation:
(6) 
The final segmentation is defined as follows:
(7) 
Each connected component of the foreground class is given a unique label and is considered an individual cell.
2.3 Training and Loss
During the training phase the network is presented with a full sequence and manual annotations , where are the ground truth (GT) labels. The network is trained using Truncated Back Propagation Through Time (TBPTT) [19]. At each back propagation step the network is unrolled to timesteps. The loss is defined using the distance weighted crossentropy loss as proposed in the original UNet paper [6]. The loss imposes separation of cells by introducing an exponential penalty factor wich is proportional to the distance of a pixel from its nearest and second nearest cells’ pixels. Consequently, pixels which are located between two adjacent cells are given significant importance whereas pixels further away from the cells have a minor effect on the loss. A detailed discussion on the weighted loss can be found in the original UNet paper [6]
3 Implementation Details
3.1 Architecture
The network comprises
encoder and decoder blocks. Each block in the encoder section is composed of CLSTM layer, leaky ReLU, convolutional layer, batch normalization
[20], leaky ReLU and finally downsampled using maxpool operation. The decoder blocks consist of a bilinear interpolation, a concatenation with the parallel encoder block and an followed by two convolutional layer, batch normalization [20], and leaky ReLU. All convolutional layers use kernel size with layer depths . All maxpool layers use kernel size without overlap. All CLSTM kernels are of size and respectively with layer depths . The last convolutional layer uses kernel size with depthfollowed by a softmax layer to produce the final probabilities (see Figure
1).3.2 Training Regime
3.3 Data
The images were annotated using two labels for the background and cell nucleus. In order to increase the variability, the data was randomly augmented spatially and temporally by: 1) random horizontal and vertical flip, 2) random rotation 3) random crop of size 4) random sequence reverse (), 5) random temporal downsampling by a factor of , 6) random affine and elastic transformations.We note that the grayscale values are not augmented as they are biologically meaningful.
4 Experiments and Results
Architecture  EncLSTM  DecLSTM  FullLSTM 
SEG  0.874  0.729  0.798 
(0.011)  (0.166)  (0.094) 
Comparison of three variants of the proposed network incorporating CLSTMs in: (1). the encoder seuction (EncLSTM) (2). the decoder section (DecLSTM) (3). both the encoder and decoder sections (FullLSTM). The training procedure was repeated three times, mean and standard deviation are presented
Dataset  EncLSTM (BGUIL)  DecLSTM (BGUIL)  First  Second  Third 
FluoN2DHSIM+  0.811 (1st)  0.802 (3rd)  0.811  0.807  0.802 
DICC2DLHeLa  0.793 (2nd)  0.511 (5th)  0.814  0.793  0.792 
PhCC2DHU373  0.842 (5th)  –  0.924  0.922  0.920 
FluoN2DHGOWT1  0.850 (8th)  0.854 (7th)  0.927  0.894  0.893 
FluoN2DLHeLa  0.811 (8th)  0.839 (6th)  0.903  0.902  0.900 
4.1 Evaluation Method
The method was evaluated using the scheme proposed in the online version of the Cell Tracking Challenge. Specifically, SEG for segmentation [22]
. The SEG measure is defined as the mean Jaccard index
of a pair of ground truth label and its corresponding segmentation . A segmentation is considered a match if .4.2 Architecture Selection:
We propose integrating the CLSTM into the UNet by substituting the convolutional layers of the encoder section with CLSTM layers (referred to as EncLSTM). In this section we compare this architecture with two alternatives by substituting: 1) the convolutional layers of the decoder section (referred to as DecLSTM); 2) the convolutional layers of both the decoder and encoder sections (referred to as FullLSTM). All three networks were trained simultaneously with identical inputs. Due to the limited size of the training set, the networks were trained on the FluoN2DHSIM+ datasets and tested on the similar FluoN2DHGOWT1 datasets from the training set of the Cell Tracking Challenge. The results as presented Table 1 show an advantage for the proposed architecture. Howerver, the dominance of the EncLSTM with respect to the DecLSTM is not conclusive as is demonstrated by the result for the cell tracking challenge discussed next. We further note that a comparison to the original UNet, without LSTM, is obtained in the challenge results and is referred to by the challenge organizers as FRRoGe. The method is labelled in Table 2 with the superscript .
4.3 Cell Tracking Challenge Results:




Two variants of the method, EncLSTM and DecLSTM, were applied to five data sets: FluoN2DHSIM+, DICC2DLHeLa, PhCC2DHU373, FluoN2DHGOWT1, FluoN2DLHeLa. The results were submitted to the Cell Tracking Challenge. The proposed EncLSTM and DecLSTM were ranked 1st and 3rd, respectively, out of 20, for the FluoN2DHSIM+ data set and 2nd out of 10 (EncLSTM) for the DICC2DLHeLa dataset. We note that for the other three datasets  training data were significantly smaller and this might explain the inferior results we received. A possible solution to this problem is using adversarial loss as suggested in [4]. In general, 20 different methods have been submitted to the challenge, including the original UNet (FRRoGE) [6] and TUGAT [23]. The latter independently and simultaneously proposed to utilize the UNet architecture while introducing CLSTM layers on the skip connections. Table 2 reports our results in comparison to the three leading methods (including ours) provided by the challenge organizers. Visualizations of the results are presented in Fig 2 and in https://youtu.be/IHULAZBmoIM. The quantitative results for top three leading methods are also publicly available at the Cell Tracking Challenge web site..
5 Summary
Timelapse microscopy cell segmentation is, inherently, a spatiotemporal task. Human annotators frequently rely on temporal queues in order to accurately separate neighbouring cells and detect partially visible cells. In this work, we demonstrate the strength of integrating temporal analysis, in the form of CLSTMs, into a well established network architecture (UNet) and examined several alternative combinations. The resulting novel architecture is able to extract meaningful features at multiple scales and propagate them through time. This enables the network to accurately segment cells in difficult scenarios where the temporal queues are crucial. Quantitative analysis shows that our method achieves stateoftheart results (Table 2) ranking 1st and 2nd place in the Cell Tracking Challenge^{3}^{3}3Based on the October 15th, 2018 ranking.. Moreover, the results reported in Table 1 demonstrate the proposed network ability to generalize from simulated training data to real data. This may imply that one can reduce and even eliminate the need for extensive manually annotated data. We further plan on incorporating adversarial loss to weaken the dependancy on training set size as in [4].
References

[1]
A. Krizhevsky, I. Sutskever, and G. E. Hinton,
“Imagenet classification with deep convolutional neural networks,”
in NIPS, 2012, pp. 1097–1105.  [2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, realtime object detection,” in Proceedings of CVPR, 2016, pp. 779–788.
 [3] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE CVPR, 2015, pp. 3431–3440.
 [4] A. Arbelle and T. Riklin Raviv, “Microscopy cell segmentation via adversarial neural networks,” arXiv preprint arXiv:1709.05860, 2017.

[5]
O. Z. Kraus, J. L. Ba, and B. J. Frey,
“Classifying and segmenting microscopy images with deep multiple instance learning,”
Bioinformatics, vol. 32, no. 12, pp. i52–i59, 2016.  [6] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” arXiv preprint arXiv:1505.04597, 2015.
 [7] F. Amat, W. Lemon, D. P. Mossing, K. McDole, Y. Wan, K. Branson, E. W. Myers, and P. J. Keller, “Fast, accurate reconstruction of cell lineages from largescale fluorescence microscopy data,” Nature methods, 2014.
 [8] A. Arbelle, N. Drayman, M. Bray, U. Alon, A. Carpenter, and T. RiklinRaviv, “Analysis of high throughput microscopy videos: Catching up with cell dynamics,” in MICCAI 2015, pp. 218–225. Springer, 2015.
 [9] M. Schiegg, P. Hanslovsky, C. Haubold, U. Koethe, L. Hufnagel, and F. A. Hamprecht, “Graphical model for joint segmentation and tracking of multiple dividing cells,” Bioinformatics, vol. 31, no. 6, pp. 948–956, 2014.
 [10] A. Arbelle, J. Reyes, J.Y. Chen, G. Lahav, and T. Riklin Raviv, “A probabilistic approach to joint cell tracking and segmentation in highthroughput microscopy videos,” Medical image analysis, vol. 47, pp. 140–152, 2018.
 [11] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [12] O. Vinyals, Ł. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton, “Grammar as a foreign language,” in Advances in Neural Information Processing Systems, 2015, pp. 2773–2781.
 [13] H. Sak, A. Senior, and F. Beaufays, “Long shortterm memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth annual conference of the international speech communication association, 2014.

[14]
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and
Y. Bengio,
“Show, attend and tell: Neural image caption generation with visual
attention,”
in
International Conference on Machine Learning
, 2015, pp. 2048–2057.  [15] S. Xingjian, Z. Chen, H. Wang, D.Y. Yeung, W.K. Wong, and W.c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” in Advances in neural information processing systems, 2015, pp. 802–810.
 [16] W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” arXiv preprint arXiv:1605.08104, 2016.
 [17] J. Chen, L. Yang, Y. Zhang, M. Alber, and D. Z. Chen, “Combining fully convolutional and recurrent neural networks for 3d biomedical image segmentation,” in Advances in Neural Information Processing Systems, 2016, pp. 3036–3044.
 [18] M. F. Stollenga, W. Byeon, M. Liwicki, and J. Schmidhuber, “Parallel multidimensional lstm, with application to fast biomedical volumetric image segmentation,” in Advances in neural information processing systems, 2015, pp. 2998–3006.
 [19] R. J. Williams and J. Peng, “An efficient gradientbased algorithm for online training of recurrent network trajectories,” Neural computation, vol. 2, no. 4, pp. 490–501, 1990.
 [20] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning, 2015, pp. 448–456.
 [21] G. Hinton, “R m s prop: Coursera lectures slides, lecture 6,” .
 [22] V. Ulman, et al., “An objective comparison of celltracking algorithms,” Nature methods, vol. 14, no. 12, pp. 1141, 2017.
 [23] C. Payer, D. Štern, T. Neff, H. Bischof, and M. Urschler, “Instance segmentation and tracking with cosine embeddings and recurrent hourglass networks,” arXiv preprint arXiv:1806.02070, 2018.
Comments
There are no comments yet.