End-to-end Learning of Convolutional Neural Net and Dynamic Programming for Left Ventricle Segmentation using Synthetic Gradient
Differentiable programming is able to combine different functions or programs in a processing pipeline with the goal of applying end-to-end learning or optimization. A significant impediment is the non-differentiable nature of some algorithms. We propose to use synthetic gradients (SG) to overcome this difficulty. SG uses the universal function approximation property of neural networks. We apply SG to combine convolutional neural network (CNN) with dynamic programming (DP) in end-to-end learning for segmenting left ventricle from short axis view of heart MRI. Our experiments show that end-to-end combination of CNN and DP requires fewer labeled images to achieve a significantly better segmentation accuracy than using only CNN.READ FULL TEXT VIEW PDF
End-to-end Learning of Convolutional Neural Net and Dynamic Programming for Left Ventricle Segmentation using Synthetic Gradient
Recent progress in medical image analysis is undoubtedly boosted by deep learning[10, 14]. Progress is observed in several medical image analysis tasks, such as segmentation [6, 19], registration , tracking  and detection . One of the significant challenges in applying deep learning to medical image analysis is limited amount of labeled data .
Our contribution in this paper is twofold. First, we demonstrate a combination of deep learning and a traditional method with strong prior knowledge can compensate for the inadequate amount of training data significantly. We use differentiable programming  (i.e., end-to-end learning) for combining different methods.
Our second contribution is a recommendation for combining a non-differentiable function with deep learning within an end-to-end learning framework. This is an extension of differentiable programming, which has been applied only to differentiable functions so far. In this context, we demonstrate the use of the universal function approximation property of neural networks by synthetic gradients (SG) technique for a non-differentiable function. SG has been used before for fast and asynchronous training of differentiable functions .
We apply SG to combine CNN with dynamic programming (DP) and refer to this method as end-to-end DP with CNN (EDPCNN). As a significant test application, we use EDPCNN to segment left ventricle from short axis heart MRI . Fig. 1 illustrates our processing pipeline. The input to the CNN (we use U-Net  in our experiments) is an MR image as shown in Fig. 2(a). Output from the CNN is a processed image, called output map, on which a pattern is overlaid in Fig. 2
(b). The pattern consists of a few graduated radial lines. We refer to it as a “star pattern.” The interpolator (“Interp” in Fig.1) interpolates output map on the points of the star pattern and warp the interpolated values in a matrix called “Warped Map” in Fig. 1. Fig. 2(c) illustrates a Warped Map. DP minimizes a cost function on the Warped Map and chooses exactly one point on each radial line in the star pattern to output a set of indices in the warped domain as shown in Fig. 2(d). Mapping the indices back to the image space gives us a closed contour as the final segmentation, as shown in Fig. 2(e). In comparison, ground truth segmentation, created by an expert, is shown in 2(f).
EDPCNN pipeline is differentiable except for argmin
function calls inside the DP module that renders the entire pipeline unsuitable for end-to-end learning. For example, if there is a differentiable loss function that measures the error between output contour and ground truth contour, we would not be able to train the system end-to-end, because gradient would not reliably flow back across theargmin function using the standard mechanisms of automatic differentiation. In the past, soft assignment has been utilized to mitigate the issue of non-differentiability for the argmin function 
. Here, we illustrate SG to approximate the gradient of the Warped Map, so that all the preceding differentiable layers (Interp and CNN) can apply standard backpropagation to learn trainable parameters. Fig.1 illustrates that an approximating neural network (“Approx. Neural Network”) creates a differentiable bypass for the non-differentiable DP module. This second neural network approximates the contour that the DP module outputs. Then a differentiable loss function is applied between the ground truth contour and the output of the approximating neural network, making backpropagation possible with automatic differentiation. This mechanism is known as synthetic gradients, because the gradients of the approximating neural network serves as a proxy for the gradients of the DP module.
Fig. 3(a) shows an ablation study to demonstrate the effectiveness of combining CNN and DP in an end-to-end learning pipeline. The horizontal axis shows the number of training images and the vertical axis shows the Dice score of LV segmentation on a fixed validation set of images. Note that when the number of training images is small, EDPCNN performs significantly better than U-Net. Eventually, as the training set grows, the gap between the Dice scores by U-Net and EDPCNN starts to close. However, we observe that EDPCNN throughout maintains its superior performance over U-Net. Results section (Table 1) shows that the performance gain of EDPCNN over U-Net comes only with a modest increase (16%) in the processing time.
Fig. 3(a) shows another experiment called “U-Net+DP”. In the U-Net+DP processing pipeline, DP is applied on the output of a trained U-Net without end-to-end training. Once again, EDPCNN shows significantly better performance than U-Net+DP for small training sets, demonstrating the effectiveness of the end-to-end learning. We hypothesize that DP infuses strong prior knowledge in the training of U-Net within EDPCNN and this prior knowledge acts as a regularizer to overcome some of challenges associated with small training data.
The organization of the paper is as follows. In the next two sections, we discuss DP and SG. Then we discuss experiments on left ventricle segmentation followed by summary and future work. We also provide our code on Github.
The left ventricle appears as a “blob” object in short axis MRI. Traditionally active contours and level set based methods were used for blob object segmentation 
. While these methods offer object shape constraints, they typically look for strong edges or statistical modeling for successful segmentation. These techniques lack a way to work with labeled images in a supervised machine learning framework. For complex segmentation tasks, such as cardiac MRI segmentation these methods are inadequate. Deep learning (DL) has invigorated interest for these classic techniques in the recent years, including our present work, because starting from raw pixels DL can be trained end-to-end with labeled images. With the exception of limited literature, such as shape prior CNN , DL lacks any inherent mechanism to incorporate prior knowledge about object shapes; instead, DL relies on the volume of labeled images to implicitly learn about object shapes or constraints. Hence, there is a need to combine CNN with these traditional methods so that the latter can provide adequate prior knowledge.
Hu et al.  proposed to use CNN to learn a level set function (signed distance transform) for salient object detection. Tang et al.  used level set in conjunction with deep learning to segment liver CT data and left ventricle from MRI. However, their method does not use end-to-end training for this combination. Deep active contours  combined CNN and active contours; the work, however, fell short of an end-to-end training process.
Literature on combined end-to-end learning is not yet abundant. End-to-end learning employing level set and deep learning-based object detector has been utilized in Le et al.’s work 
, where the authors modeled level set computation as a recurrent neural network. Marcoset al.  have combined CNN and active contours in end-to-end training with a structured loss function.
Proposed EDPCNN is another addition to the growing repertoire combining CNN and active contours with a noteworthy novelty. While all the aforementioned literature on segmentation combines differentiable components, in EDPCNN we demonstrate how to combine a DP-based active contour with CNN in an end-to-end fashion, where DP is non-differentibale.
Medical image analysis often has to deal with limited amount of labeled / annotated images. DL has been most successful where plenty of data was annotated, e.g., diabetic retinopathy 
. Transfer learning is the dominant approach to deal with limited labeled data in medical image analysis. In transfer learning, a deep network is first trained on an unrelated, but large dataset, such as Imagenet; then the trained model is fine-tuned on smaller data set specific to the task. Transfer learning has been applied for lymph node detection and classification, localization of kidney  and many other tasks . Data augmentation is also applied to deal with limited labeled data .
In this work, we present a complementary approach to work with limited amount of labeled images. Our guiding principle is to inject the learning system with prior knowledge about the solution. A similar argument was made by Ngo, Lu, and Carneiro  for combining level set and CNN to work with limited labeled data for left ventricle segmentation. For this segmentation task, the prior knowledge is a smooth shape, which can be modeled as a closed contour drawn through a star pattern. To inject such knowledge into the learning system, we resort to the principle of differentiable programming, where more than one differentiable algorithms are stitched together. However, the added difficulty in our case is the non-differentiable nature of DP that we overcome using SG.
Use of DP in computer vision is wide ranging, including interactive object segmentation. Here, we use the DP setup described in  to delineate star-shaped/blob objects that perfectly describe left ventricles in the short axis view.
Let the star pattern have radial lines with points on each line. DP minimizes the following cost function:
where each variable is descrete and . Cost component for the radial line is and it is defined as follows:
where is the Warped Map in the EDPCNN pipeline (Fig. 1), with representing the value of Warped Map on the point of radial line The symbol denotes a modulo addition, so that and for The discrete variable represents the index of a point on radial line DP selects exactly one point on each radial line to minimize the directional derivatives of along the radial lines. The collection of indices chosen by DP forms a closed contour representing a delineated left ventricle. To maintain the continuity of the closed contour, (2) imposes a constraint: chosen points on two consecutive radial lines have to be within a distance In this fashion, DP acts as a blob object boundary detector maximizing edge contrast, while maintaining a continuity constraint.
Algorithm 1 implements DP, where a number of calls to the argmin function are responsible for the non-differentiable nature of it. So, during end-to-end learning we cannot rely on the automatic differentiation of software packages. Algorithm 1
can be efficiently vectorized to accommodate image batches suitable for running on GPUs.
SG uses the universal function approximation property of neural networks. SG can train deep neural networks asynchronously to yield faster training . In order to use SG in the EDPCNN processing pipeline, as before, let us first denote by the Warped Map, which is input to the DP module. Let denote a differentiable loss function which evaluates the collection of indices output from the DP module against its ground truth , which can be obtained by taking the intersection between the ground truth segmentation mask and the radial lines of the star pattern. Let us also denote by a neural network, which takes as input and outputs a softmax function to mimic the output of DP. In Fig. 1, apperas as “Approx. Neural Network.” Let and denote the trainable parameters of and U-Net, respectively. The inner minimization in the SG algorithm (Algorithm 2) trains the approximating neural network , whereas the outer minimization trains U-Net. Both the networks being differentiable are trained by backpropagation using automatic differentiation. The general idea here is to train to mimic the output indices of the DP module as closely as possible, then use to approximate , bypassing the non-differentiable steps of DP entirely. Minimizing then becomes minimizing with this approximation.
The loss function in this work is chosen to be the Cross Entropy between the output of against the one-hot form of or . In this case, comprises of vectors, each of size , representing the softmax output of the classification problem for selecting an index on each radial line.
We have observed that introducing randomness as a way of exploration in the inner loop by adding to is important for the algorithm to succeed. Instead of minimizing , we minimize . In comparison, the use of SG in asynchronous training  did not have to resort to any such exploration mechanism. The correctness of the gradient provided by SG depends on how well fits the surface of the DP algorithm around . We hypothesize that without sufficient exploration added,
will overfit to a few points on the surface and lead to improper gradient signal. Hyperparametercan be set using cross validation, while the number of noise samples controls trade off between gradient accuracy and training time. We found that and works well for our experiments.
We evaluate the performance of EDPCNN against U-Net on a modified ACDC  datatset. As the test set is not publicly available, we split the original training set into a training set and a validation set according to . Following the same work, the images are re-sampled to a resolution of 212
212. As the original U-Net model does not use padded convolution, each image in the dataset has to be padded to sizeat the beginning, so that the final output has the same size as the original image. After these steps, we remove all images that does not have the left ventricle class from the datasets, resulting in a training set of 1436 images and a validation set of 372 images.
We train U-Net and EDPCNN increasing training sample size from 10 training images to the full training set size, 1436. To avoid ordering bias, we randomly shuffle the entire training set once, then choose training images from the beginning of the shuffled set, so that each smaller training set is successively contained in the bigger sets, creating telescopic training sets, suitable for an ablation study.
As the output contour of DP may sometimes be jagged, we employ a postprocessing step where the output indices are smoothed by a fixed 1D moving average convolution filter with circular padding. The size of the convolutional filter is set using a heuristic to be around one-fourth the number of radial lines on the star pattern. This post-processing also has the effects of pushing the contour to be closer to a circle, which is also a good prior for the left ventricle. This step improves our validation accuracy by around 0.5 to 0.8 percent. Since SG mimics the post-processed output, postprocessing is a part of the end-to-end processing.
For evaluation of a segmentation against its corresponding ground truth, we use Dice score , a widely accepted metric for medical image segmentation. EDPCNN requires the star pattern to be available so that the output of U-Net can be interpolated on the star pattern to produce Warped Map. The star pattern is fixed; but its center can be supplied by a user in the interactive segmentation. For all our experiments, the ground truth left ventricle center for an image serves as the center of the star pattern for the same image. While by design EPDCNN outputs a single connected component, U-Net can produce as many components without any control. Thus, to treat the evaluation of U-Net fairly against EDPCNN, in all the experiments we compute Dice scores within a square, which tightly fits the star pattern. So, any connected component produced by U-Net outside of this square is discarded during Dice score computation.
We train U-Net and EDPCNN using Adam optimizer  with , , and a learning rate value of 0.0001 to make the training of U-Net stable. Training batch size is 10 for each iteration and the total number of iteration is 20000. No learning rate decay as well as weight decay are used because we have not found these helpful. We evaluate each method on the validation set after every 50 iterations and select the model with the highest validation Dice score.
For EDPCNN, we use nearest neighbor method to interpolate the output of U-Net on the star pattern to compute Warped Map
We choose the center of the star pattern for each image to be the center of mass. To make the model more robust and have better generalization, during training, we randomly jitter the center of the star pattern with the requirement that the center will still stay inside the object. Define ”object radius” as the distance between the center of mass of an object to its nearest points on the contour. We then randomly move the true center inside a 2D truncated normal distribution with mean equal to the coordinate of the center of mass and standard deviation equal to the object radius. We find that this kind of jittering can improve the dice score on smaller training sets by up to about 2%. We also randomly rotate the star pattern from -0.5 to 0.5 radian as an additional random exploration.
The radius of the star pattern is chosen to be 65 so that all objects in the training set can be covered by the pattern after taking into account the random placement of the center during training. The number of points on a radial line has also been chosen to be the radius of the star pattern: . For the number of radial lines and the smoothness parameter , we run a grid search over , and find , to be good values. We also find that the performance of our algorithm is quite robust to the choices of these hyperparameters. The Dice score only drops around 3% when the values of and are extreme (e.g. , ). Lastly, for the optimization of in Algorithm 2, to make fit well enough, we do the minimization step repeatedly for 10 times.
The architecture of used to approximate the output of DP is a U-Net-like architecture. As the size of is smaller and the complexity of is likely to be less than the original image, instead of having 4 encoder and 4 decoder blocks as in U-Net, only has 3 encoder and 3 decoder blocks. Additionally, we use padding for convolutions/transposed convolutions in the encoder/decoder blocks so that those layers keep the size of the feature maps unchanged instead of doing a large padding at the beginning like in U-Net. This is purely for convenience. Note that these choices can be arbitrary as long as can fit the surface of DP well enough. For the same reason, we find that the number of output channels in the first convolution of , called , is an important hyperparameter because this value controls the capacity of and affects how well fits the surface of DP. We find that works well for our algorithm (compared to 64 in U-Net).
Supply of the target object center to EDPCNN can be perceived as a significant advantage. We argue that this advantage cannot overshadow the contribution of end-to-end learning. To establish this claim, we refer readers to Fig. 3(a) and note that the Unet+DP model, despite having the same advantage, lags significantly behind EDPCNN. Therefore, end-to-end learning is the only attributable factor behind the success of EDPCNN.
Further, to test the robustness of EDPCNN with respect to the position of the star pattern center, we perform an experiment where the supplied center during testing is purposely jittered inside the object in the way it was done during training. Fig. 3(b) shows the effect of random jitter with the increase of jitter radius from no jitter to 0.5 of the object radius. We can see that there is no significant degradation in performance, especially for 0.2 jitter or below. Fig. 3(b) plots the average Dice scores for these experiments. In all the cases, the standard deviation of Dice scores remains small, below 0.01. Thus, the standard deviation has not been shown in Fig. 3(b).
Fig. 4 show training iterations vs. Dice scores for training and validation sets. Two training sample sizes were shown: 10 and 1436 (full training set). For training sample size 10, the Dice scores on the validation set show significant variations and eventual overfitting for the U-Net model, while EDPCNN does not exhibit such a tendency. This overfitting behaviour is counter intuitive, because learnable parameters in EDPCNN form a superset for those in U-Net. Our hypothesis is that a strong object model and prior knowledge infused by DP into U-Net prevents overfitting.
|Method||Time / Training iteration||Total iterations||Total training time||Inference time / Image|
Finally, Table 1 shows running time for U-Net and EDPCNN. We observe that computationally EDPCNN is about 64% more expensive during training. However, test time for EDPCNN is only about 16% more than that of U-Net.
In this work, we illustrate how to combine convolutional neural networks and dynamic programming for end-to-end learning. Combination of CNN and traditional tools is not new; however, the novelty here is to handle a non-differentiable module, dynamic programming, within the end-to-end pipeline. We employ a neural network to approximate the gradient of the non-differentiable module. We found that the approximating neural network should have an exploration mechanism to be successful.
As a significant application we choose left ventricle segmentation from short axis MRI. Our experiments show that end-to-end combination is beneficial when training data size is small. Our end-to-end model has very little computational overhead, making it a practical choice.
In the future, we plan to segment myocardium and right ventricle with automated placement of star patterns. For these and many other segmentation tasks in medical image analysis, strong object models given by traditional functional modules, such as dynamic programming, provide a way to cope with the lack of training data. Our presented method has the potential to become a blueprint to expand differentiable programming to include non-differentiable modules.
Bahdanau, D., Cho, K., Bengio, Y.: Neural Machine Translation by Jointly Learning to Align and Translate. ArXiv e-prints (Sep 2014)
Ghosal, S., Ray, N.: Deep deformable registration: Enhancing accuracy by fully convolutional neural net. Pattern Recognition Letters94, 81 – 86 (2017). https://doi.org/https://doi.org/10.1016/j.patrec.2017.05.022