Live cell microscopy imaging is a powerful tool and an important part of the biological research process. The automatic annotation of the image sequences is crucial for the quantitative analysis of properties such as cell size, mobility, and protein levels. Recent image analysis approaches have shown the strengths of Convolutional Neural Networks (CNNs) which surpass state-of-the-art methods in virtually all fields, such as object classification , detection , semantic segmentation , and many other tasks. Attempts at cell segmentation using CNNs include [4, 5, 6]
. All these methods, however, are trained on independent, non sequential, frames and do not incorporate any temporal information which can potentially facilitate segmentation in cases of neighboring cells that are hard to separate or when a cell partially vanishes. The use of temporal information by combining tracking information from individual cells to support segmentation decisions has been shown to improve results for non deep learning methods[7, 8, 9, 10] but have not yet been extensively examined in a deep learning approaches.
A Recurrent Neural Network (RNN) is an artificial neural network equipped with feed-back connections. This unique architecture makes it suitable for the analysis of dynamic behavior. A special variant of RNNs is Long Short Term Memory (LSTM), which includes an internal memory state vector with gating operations thus stabilizing the training process
. Common LSTM based applications include natural language processing (NLP), audio processing  and image captioning .
Convolutional LSTMs (C-LSTMs) accommodate locally spatial information in image sequences by replacing matrix multiplication with convolutions . The C-LSTM has recently been used to address the analysis of both temporal image sequences, such as next frame prediction , and volumetric data sets [17, 18]. In  C-LSTM is applied in multiple directions for the segmentation of 3D data represented as a stack of 2D slices. Another approach for 3D brain structure segmentation is proposed in , where each slice is separately fed into a U-Net architecture, and only the output then fed into bi-directional C-LSTMs.
In this paper we introduce the integration of C-LSTMs into an encoder-decoder structure (U-Net) allowing compact spatio-temporal representations in multiple scales. We note that, unlike  which was designed and evaluated on 3D brain segmentation, the proposed novel architecture is an intertwined composition of the two concepts rather than a pipeline. Furthermore, since our method is designed for image sequence segmentation which can be very long the bi-directional C-LSTM is not computationally feasible. Our framework is assessed using time-lapse microscopy data where both cells’ dynamics and their spatial properties should be considered. Specifically, we tested our method on the Cell Tracking Challenge: http://www.celltrackingchallenge.net. Our method was ranked in the top three by the challenge organizers on the several submitted data sets, specifically on the fluorescent simulated dataset (Fluo-N2DH-SIM+) and the differential interference contrast (DIC-C2DL-HeLa) sequences which are difficult to segment.
The rest of the paper is organized as follows. Section 2 presents a probabilistic formulation of the problem and elaborates on the proposed network. Technical aspects are detailed in Section 3. In Section 4 we demonstrate the strength of our method, presenting state-of-the-art cell segmentation results. We conclude in Section 5.
2.1 Network Architecture
The proposed network incorporates C-LSTM  blocks into the U-Net  architecture. This combination, as suggested here, is shown to be powerful. The U-Net architecture, built as an encoder-decoder with skip connections, enables to extract meaningful descriptors at multiple image scales. However, this alone does not account for the cell specific dynamics that can significantly support the segmentation. The introduction of C-LSTM blocks into the network allows considering past cell appearances at multiple scales by holding their compact representations in the C-LSTM memory units.
We propose here the incorporation of C-LSTM layers in every scale of the encoder section of the U-Net. Applying the CLSTM on multiple scales is essential for cell microscopy sequences (as opposed to brain slices as in ) since the frame to frame differences might be at different scales, depending on cells’ dynamics. Moreover, in contrast to brain volume segmentation  the microscopy sequence can be of arbitrary length, making the use of bi-directional LSTMs computationally impractical and the cells can move at different speeds and the changes are not normally smooth.
The comparison to other alternatives is presented in Section 4.2.
The network is fully convolutional and, therefore, can be used with any image size111 In order to avoid artefacts it is preferable to use image sizes which are multiples of eight due to the three max-pooling layers.
In order to avoid artefacts it is preferable to use image sizes which are multiples of eight due to the three max-pooling layers.during both training and testing. Figure 1 illustrates the network architecture detailed in Section 3.
We address individual cells’ segmentation from microscopy sequences. The main challenge in this type of problems is not only foreground-background classification but also the separation of adjacent cells. We adopt the weighted distance loss as suggested by . The loss is designed to enhance individual cells’ delineation by a partitioning of the dimensional (2 or 3) image domain into two classes: foreground and background, such that pixels which are near the boundaries of two adjacent cells are given higher importance. We set to denote these classes, respectively. Let be the input image sequence of length , where is a grayscale image. The network is composed of two sections of blocks each, the encoder recurrent block and the decoder block where are the network’s parameters. The input to the C-LSTM encoder layer at time includes the down-sampled output of the previous layer, the output of the current layer at the previous time-step and the C-LSTM memory cell. We denote these three inputs as , , respectively. Formally we define:
The inputs to the decoder layers are the up-sampled 222 We use bi-linear interpolation
We use bi-linear interpolationoutput of the previous layer and the output of the corresponding layer from the encoder denoted by and respectively. We denote the decoder output as . Formally,
We define a network with parameters as the composition of encoder blocks followed by decoder blocks, and denote . Note that the encoder blocks, , encode high-level spatio-temporal features at multiple scales and the decoder blocks, , refines that information into a full scale segmentation map.
We set the final output as a -dimensional feature vector corresponding to each input pixel
. We define the segmentation as the pixel label probabilities using the softmax equation:
The final segmentation is defined as follows:
Each connected component of the foreground class is given a unique label and is considered an individual cell.
2.3 Training and Loss
During the training phase the network is presented with a full sequence and manual annotations , where are the ground truth (GT) labels. The network is trained using Truncated Back Propagation Through Time (TBPTT) . At each back propagation step the network is unrolled to time-steps. The loss is defined using the distance weighted cross-entropy loss as proposed in the original U-Net paper . The loss imposes separation of cells by introducing an exponential penalty factor wich is proportional to the distance of a pixel from its nearest and second nearest cells’ pixels. Consequently, pixels which are located between two adjacent cells are given significant importance whereas pixels further away from the cells have a minor effect on the loss. A detailed discussion on the weighted loss can be found in the original U-Net paper 
3 Implementation Details
The network comprises
encoder and decoder blocks. Each block in the encoder section is composed of C-LSTM layer, leaky ReLU, convolutional layer, batch normalization, leaky ReLU and finally down-sampled using maxpool operation. The decoder blocks consist of a bi-linear interpolation, a concatenation with the parallel encoder block and an followed by two convolutional layer, batch normalization , and leaky ReLU. All convolutional layers use kernel size with layer depths . All maxpool layers use kernel size without overlap. All C-LSTM kernels are of size and respectively with layer depths . The last convolutional layer uses kernel size with depth
followed by a softmax layer to produce the final probabilities (see Figure1).
3.2 Training Regime
The images were annotated using two labels for the background and cell nucleus. In order to increase the variability, the data was randomly augmented spatially and temporally by: 1) random horizontal and vertical flip, 2) random rotation 3) random crop of size 4) random sequence reverse (), 5) random temporal down-sampling by a factor of , 6) random affine and elastic transformations.We note that the gray-scale values are not augmented as they are biologically meaningful.
4 Experiments and Results
Comparison of three variants of the proposed network incorporating C-LSTMs in: (1). the encoder seuction (EncLSTM) (2). the decoder section (DecLSTM) (3). both the encoder and decoder sections (FullLSTM). The training procedure was repeated three times, mean and standard deviation are presented
|Dataset||EncLSTM (BGU-IL)||DecLSTM (BGU-IL)||First||Second||Third|
|Fluo-N2DH-SIM+||0.811 (1st)||0.802 (3rd)||0.811||0.807||0.802|
|DIC-C2DL-HeLa||0.793 (2nd)||0.511 (5th)||0.814||0.793||0.792|
|Fluo-N2DH-GOWT1||0.850 (8th)||0.854 (7th)||0.927||0.894||0.893|
|Fluo-N2DL-HeLa||0.811 (8th)||0.839 (6th)||0.903||0.902||0.900|
4.1 Evaluation Method
The method was evaluated using the scheme proposed in the online version of the Cell Tracking Challenge. Specifically, SEG for segmentation 
. The SEG measure is defined as the mean Jaccard indexof a pair of ground truth label and its corresponding segmentation . A segmentation is considered a match if .
4.2 Architecture Selection:
We propose integrating the CLSTM into the U-Net by substituting the convolutional layers of the encoder section with C-LSTM layers (referred to as EncLSTM). In this section we compare this architecture with two alternatives by substituting: 1) the convolutional layers of the decoder section (referred to as DecLSTM); 2) the convolutional layers of both the decoder and encoder sections (referred to as FullLSTM). All three networks were trained simultaneously with identical inputs. Due to the limited size of the training set, the networks were trained on the Fluo-N2DH-SIM+ datasets and tested on the similar Fluo-N2DH-GOWT1 datasets from the training set of the Cell Tracking Challenge. The results as presented Table 1 show an advantage for the proposed architecture. Howerver, the dominance of the EncLSTM with respect to the DecLSTM is not conclusive as is demonstrated by the result for the cell tracking challenge discussed next. We further note that a comparison to the original U-Net, without LSTM, is obtained in the challenge results and is referred to by the challenge organizers as FR-Ro-Ge. The method is labelled in Table 2 with the superscript .
4.3 Cell Tracking Challenge Results:
Two variants of the method, EncLSTM and DecLSTM, were applied to five data sets: Fluo-N2DH-SIM+, DIC-C2DL-HeLa, PhC-C2DH-U373, Fluo-N2DH-GOWT1, Fluo-N2DL-HeLa. The results were submitted to the Cell Tracking Challenge. The proposed Enc-LSTM and Dec-LSTM were ranked 1st and 3rd, respectively, out of 20, for the Fluo-N2DH-SIM+ data set and 2nd out of 10 (EncLSTM) for the DIC-C2DL-HeLa dataset. We note that for the other three datasets - training data were significantly smaller and this might explain the inferior results we received. A possible solution to this problem is using adversarial loss as suggested in . In general, 20 different methods have been submitted to the challenge, including the original U-Net (FR-Ro-GE)  and TUG-AT . The latter independently and simultaneously proposed to utilize the U-Net architecture while introducing C-LSTM layers on the skip connections. Table 2 reports our results in comparison to the three leading methods (including ours) provided by the challenge organizers. Visualizations of the results are presented in Fig 2 and in https://youtu.be/IHULAZBmoIM. The quantitative results for top three leading methods are also publicly available at the Cell Tracking Challenge web site..
Time-lapse microscopy cell segmentation is, inherently, a spatio-temporal task. Human annotators frequently rely on temporal queues in order to accurately separate neighbouring cells and detect partially visible cells. In this work, we demonstrate the strength of integrating temporal analysis, in the form of C-LSTMs, into a well established network architecture (U-Net) and examined several alternative combinations. The resulting novel architecture is able to extract meaningful features at multiple scales and propagate them through time. This enables the network to accurately segment cells in difficult scenarios where the temporal queues are crucial. Quantitative analysis shows that our method achieves state-of-the-art results (Table 2) ranking 1st and 2nd place in the Cell Tracking Challenge333Based on the October 15th, 2018 ranking.. Moreover, the results reported in Table 1 demonstrate the proposed network ability to generalize from simulated training data to real data. This may imply that one can reduce and even eliminate the need for extensive manually annotated data. We further plan on incorporating adversarial loss to weaken the dependancy on training set size as in .
A. Krizhevsky, I. Sutskever, and G. E. Hinton,
“Imagenet classification with deep convolutional neural networks,”in NIPS, 2012, pp. 1097–1105.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of CVPR, 2016, pp. 779–788.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE CVPR, 2015, pp. 3431–3440.
-  A. Arbelle and T. Riklin Raviv, “Microscopy cell segmentation via adversarial neural networks,” arXiv preprint arXiv:1709.05860, 2017.
O. Z. Kraus, J. L. Ba, and B. J. Frey,
“Classifying and segmenting microscopy images with deep multiple instance learning,”Bioinformatics, vol. 32, no. 12, pp. i52–i59, 2016.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” arXiv preprint arXiv:1505.04597, 2015.
-  F. Amat, W. Lemon, D. P. Mossing, K. McDole, Y. Wan, K. Branson, E. W. Myers, and P. J. Keller, “Fast, accurate reconstruction of cell lineages from large-scale fluorescence microscopy data,” Nature methods, 2014.
-  A. Arbelle, N. Drayman, M. Bray, U. Alon, A. Carpenter, and T. Riklin-Raviv, “Analysis of high throughput microscopy videos: Catching up with cell dynamics,” in MICCAI 2015, pp. 218–225. Springer, 2015.
-  M. Schiegg, P. Hanslovsky, C. Haubold, U. Koethe, L. Hufnagel, and F. A. Hamprecht, “Graphical model for joint segmentation and tracking of multiple dividing cells,” Bioinformatics, vol. 31, no. 6, pp. 948–956, 2014.
-  A. Arbelle, J. Reyes, J.-Y. Chen, G. Lahav, and T. Riklin Raviv, “A probabilistic approach to joint cell tracking and segmentation in high-throughput microscopy videos,” Medical image analysis, vol. 47, pp. 140–152, 2018.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  O. Vinyals, Ł. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton, “Grammar as a foreign language,” in Advances in Neural Information Processing Systems, 2015, pp. 2773–2781.
-  H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth annual conference of the international speech communication association, 2014.
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and
“Show, attend and tell: Neural image caption generation with visual
International Conference on Machine Learning, 2015, pp. 2048–2057.
-  S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” in Advances in neural information processing systems, 2015, pp. 802–810.
-  W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” arXiv preprint arXiv:1605.08104, 2016.
-  J. Chen, L. Yang, Y. Zhang, M. Alber, and D. Z. Chen, “Combining fully convolutional and recurrent neural networks for 3d biomedical image segmentation,” in Advances in Neural Information Processing Systems, 2016, pp. 3036–3044.
-  M. F. Stollenga, W. Byeon, M. Liwicki, and J. Schmidhuber, “Parallel multi-dimensional lstm, with application to fast biomedical volumetric image segmentation,” in Advances in neural information processing systems, 2015, pp. 2998–3006.
-  R. J. Williams and J. Peng, “An efficient gradient-based algorithm for on-line training of recurrent network trajectories,” Neural computation, vol. 2, no. 4, pp. 490–501, 1990.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning, 2015, pp. 448–456.
-  G. Hinton, “R m s prop: Coursera lectures slides, lecture 6,” .
-  V. Ulman, et al., “An objective comparison of cell-tracking algorithms,” Nature methods, vol. 14, no. 12, pp. 1141, 2017.
-  C. Payer, D. Štern, T. Neff, H. Bischof, and M. Urschler, “Instance segmentation and tracking with cosine embeddings and recurrent hourglass networks,” arXiv preprint arXiv:1806.02070, 2018.