Image Segmentation by Iterative Inference from Conditional Score Estimation

05/21/2017 ∙ by Adriana Romero, et al. ∙ Université de Montréal 0

Inspired by the combination of feedforward and iterative computations in the virtual cortex, and taking advantage of the ability of denoising autoencoders to estimate the score of a joint distribution, we propose a novel approach to iterative inference for capturing and exploiting the complex joint distribution of output variables conditioned on some input variables. This approach is applied to image pixel-wise segmentation, with the estimated conditional score used to perform gradient ascent towards a mode of the estimated conditional distribution. This extends previous work on score estimation by denoising autoencoders to the case of a conditional distribution, with a novel use of a corrupted feedforward predictor replacing Gaussian corruption. An advantage of this approach over more classical ways to perform iterative inference for structured outputs, like conditional random fields (CRFs), is that it is not any more necessary to define an explicit energy function linking the output variables. To keep computations tractable, such energy function parametrizations are typically fairly constrained, involving only a few neighbors of each of the output variables in each clique. We experimentally find that the proposed iterative inference from conditional score estimation by conditional denoising autoencoders performs better than comparable models based on CRFs or those not using any explicit modeling of the conditional joint distribution of outputs.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Based on response timing and propagation delays in the brain, a plausible hypothesis is that the visual cortex can perform fast feedforward (ThorpeFizeMarlot96, ) inference when an answer is needed quickly and the image interpretation is easy enough (requiring as little as 200ms of cortical propagation for some object recognition tasks, i.e., just enough time for a single feedforward pass) but needs more time and iterative inference in the case of more complex inputs (Vanmarcke-et-al-2016, )

. Recent deep learning research and the success of residual networks

(He2016DeepRL, ; GreffSS16, ) point towards a similar scenario (Liao2016BridgingTG, )

: early computation which is feedforward, a series of non-linear transformations which map low-level features to high-level ones, while later computation is iterative (using lateral and feedback connections in the brain) in order to capture complex dependencies between different elements of the interpretation. Indeed, whereas a purely feedforward network could model a unimodal posterior distribution (e.g., the expected target with some uncertainty around it), the joint conditional distribution of output variables given inputs could be more complex and multimodal. Iterative inference could then be used to either sample from this joint distribution or converge towards a dominant mode of that distribution, whereas a unimodal-output feedfoward network would converge to some statistic like the conditional expectation, which might not correspond to a coherent configuration of the output variables when the actual conditional distribution is multimodal.

This paper proposes such an approach combining a first feedforward phase with a second iterative phase corresponding to searching for a dominant mode of the conditional distribution while tackling the problem of semantic image segmentation. We take advantage of theoretical results (Alain+Bengio-ICLR2013, ) on denoising autoencoder (DAE), which show that they can estimate the score or negative gradient of the energy function of the joint distribution of observed variables: the difference between the reconstruction and the input points in the direction of that estimated gradient. We propose to condition the autoencoder with an additional input so as to obtain the conditional score, i.e., the gradient of the energy of the conditional density of the output variables, given the input variables. The autoencoder takes a candidate output as well as an input and outputs a value so that estimates the direction . We can then take a gradient step in that direction and update towards a lower-energy value and iterate in order to approach a mode of the implicit captured by the autoencoder. We find that instead of corrupting the segmentation target as input of the DAE, we obtain better results by training the DAE with the corrupted feedforward prediction, which is closer to what will be seen as the initial state of the iterative inference process. The use of a denoising autoencoder framework to estimate the gradient of the energy is an alternative to more traditional graphical modeling approaches, e.g., with conditional random fields (CRFs) (Lafferty01CRF, ; He04MCR, ), which have been used to model the joint distribution of pixel labels given an image (Koltun11, ). The potential advantage of the DAE approach is that it is not necessary to decide on an explicitly parametrized energy function: such energy functions tend to only capture local interactions between neighboring pixels, whereas a convolutional DAE can potentially capture dependencies of any order and across the whole image, taking advantage of the state-of-the-art in deep convolutional architectures in order to model these dependencies via the direct estimation of the energy function gradient. Note that this is different from the use of convolutional networks for the feedforward part of the network, and regards the modeling of the conditional joint distribution of output pixel labels, given image features.

(a) Training scenario 1
(b) Training scenario 2
Figure 1: Training pipeline. Given an input image , we extract a segmentation candidate and intermediate feature maps by applying a pre-trained segmentation network. We add some noise to and train a DAE that takes as input both and by minimizing Eq. 5. Training scenario 1 (a) yields the best results and uses the corrupted prediction as input to the DAE during training. Training scenario 2 (b) corresponds to the original DAE prescription in the conditional case, and uses a corruption of the ground truth as input to the DAE during training.

The main contributions of this paper are the following:

  1. A novel training framework for modeling structured output conditional distributions which is an alternative to CRFs, based on denoising autoencoder estimation of energy gradients.

  2. Showing how this framework can be used in an architecture for image pixelwise segmentation in which the above energy gradient estimator is used to propose a highly probable segmentation through gradient descent in the output space.

  3. Demonstrating that this approach to image segmentation outperforms or matches classical alternatives such as combining convolutional nets with CRFs and more recent state-of-the-art alternatives on the CamVid dataset.

2 Method

In this section, we describe the proposed iterative inference method to refine the segmentation of a feedforward network.

2.1 Background

As pointed in section 1, DAE can estimate a density via an estimator of the score or negative gradient of the energy function  (Vincent2010SDA, ; Vincent-NC-2011, ; Alain+Bengio-ICLR2013, ). These theoretical analyses of DAE are presented for the particular case where the corruption noise added to the input is Gaussian. Results show that DAE can estimate the gradient of the energy function of a joint distribution of observed variables. The main result is the following:


where is the amount of Gaussian noise injected during training, is the input of the autoencoder and is its output (the reconstruction). The approximation becomes exact as and the autoencoder is given enough capacity, training examples and training time. The direction of points towards more likely configurations of

. Therefore the DAE learns a vector field pointing towards the manifold where the input data lies.

2.2 Our framework

In our case, we seek to rapidly learn a vector field pointing towards more probable configurations of . We propose to extend the results summarized in subsection 2.1 and condition the autoencoder with an additional input. If we condition the autoencoder with features , which are a function of , the DAE framework with Gaussian corruption learns to estimate , where is a segmentation candidate, an input image and is an energy function. Gradient descent in energy can thus be performed in order to iteratively reach a mode of the estimated conditional distribution:


with step size . In addition, whereas Gaussian noise around the target would be the DAE prescription for the corrupted input to be mapped to , this may be inefficient at visiting the configurations we really care about, i.e., those produced by our feedforward predictor, which we use to obtain a first guess for , as initialization of the iterative inference towards an energy minimum. Therefore, we propose that during training, instead of using a corrupted as input, the DAE takes as input a corrupted segmentation candidate and either the input or some features extracted from a feedforward segmentation network applied to :


where is a non-linear function and is the index of a layer in the feedforward segmentation network. The output of the DAE is computed as


where is a non-linear function which is trained to denoise conditionally and is a corrupted form of . During training, is plus noise, while at test time (for inference) it is simply itself.

In order to train the DAE, (1) we extract both and from a feedforward segmentation network; (2) we corrupt into ; and (3) we train the DAE by minimizing the following loss


where is the categorical cross-entropy and is the segmentation ground truth.

Figure 1(a) depicts the pipeline during training. First, a fully convolutional feedforward network for segmentation is trained. In practice, we use one of the state-of-the-art pre-trained networks. Second, given an input image , we extract a segmentation proposal and intermediate features from the segmentation network. Both and are fed to a DAE network (adding Gaussian noise to ). The DAE is trained to properly reconstruct the clean segmentation (ground truth ). Figure 1(b) presents the original DAE prescription , where the DAE is trained by taking as input and .

Once trained, we can exploit the trained model to iteratively take gradient steps in the direction of the segmentation manifold. To do so, we first obtain a segmentation proposal from the feedforward network and then we iteratively refine this proposal by applying the following rule


For practical reasons, we collapsed the corruption noise into the step size .

Figure 2 depicts the test pipeline. We start with an input image that we feed to a pre-trained segmentation network. The segmentation networks outputs some intermediate feature maps and a segmentation proposal . Then, both and are fed to the DAE to compute the output . The DAE is used to take iterative gradient steps towards the manifold of segmentation masks, with no noise added at inference time.

Figure 2: Test pipeline. Given an input image , we extract a segmentation candidate and intermediate feature maps by applying a pre-trained segmentation network. We then feed and to the trained DAE and iteratively refine by applying Eq. 6. The final prediction is the last value of computed in this way.

3 Related Work

On one hand, recent advances in semantic segmentation mainly focus on improving the architecture design (ronneberger2015u, ; SegNet2015, ; DrozdzalVCKP16, ; Jegou17, ), increasing the context understanding capabilities (Gatta14-deepvision, ; VisinKCBMC15, ; chen14semantic, ; YuKoltun2016, ) and building processing modules to enforce structure consistency to segmentation outputs (Koltun11, ; chen14semantic, ; CRFasRNN, ). Here, we are interested in this last research direction. CRFs are among the most popular choices to impose structured information at the output of a segmentation network, being fully connected CRFs (Koltun11, ) and CRFs as RNNs CRFasRNN among best performing variants. More recently, an alternative to promote structure consistency by decomposing the prediction process into multiple steps and iteratively adding structure information, was introduced in (li2016iterative, ). Another iterative approach was introduced in GidarisK16a to tackle image semantic segmentation by repeatedly detecting, replacing and refining segmentation masks. Finally, the reinterpretation of residual networks Liao2016BridgingTG ; GreffSS16 was exploited in DrozdzalCVDTRBP17 , in the context of biomedical image segmentation, by iteratively refining learned pre-normalized images to generate segmentation predictions.

On the other hand, there has recently been some research devoted to exploit results of DAE on different tasks, such as image generation (NguyenYBDC16, ), high resolution image estimation (Sonderby2017, ) and semantic segmentation (Xie2016, ). In (NguyenYBDC16, ), authors propose plug & play generative networks, which, in the best reported results, train a fully-connected DAE to reconstruct a denoised version of some feature maps extracted from an image classification network. The iterative update rule at inference time is performed in the feature space. In (Sonderby2017, )

, authors use DAE in the context of image super-resolution to learn the gradient of the density of high resolution images and apply it to refine the output of an upsampled low resolution image. In 

(Xie2016, ), authors exploit convolutional pseudo priors trained on the ground-truth labels in semantic segmentation task. During the training phase, the pseudo-prior is combined with the segmentation proposal from a segmentation model to produce joint distribution over data and labels. At test time, the ground truth is not accessible, thus the FCN predictions are fed iteratively to the convolutional pseudo-prior network. In this work, we exploit DAEs in the context of image segmentation and extend them in two ways, first by using them to learn a conditional score, and second by using a corrupted feedforward prediction as input during training to obtain better segmentation results.

4 Experiments

The main objective of these experiments is to answer the following questions:

  1. Can a conditional DAE be used successfully as the building block of iterative inference for image segmentation?

  2. Does our proposed corruption model (based on the feedforward prediction) work better than the prescribed target output corruption?

  3. Does the resulting segmentation system outperform more classical iterative approaches to segmentation such as CRFs?

4.1 CamVid Dataset

CamVid111 (camvid, )

is a fully annotated urban scene understanding dataset. It contains videos that are fully segmented. We used the same split and image resolution as 

(SegNet2015, ). The split contains 367 images (video frames) for training, 101 for validation and 233 for test. Each frame has a resolution of 360x480 and pixels are labeled with 11 different semantic classes.

4.2 Feedforward segmentation architecture

We experimented with two feedforward architectures for segmentation: the classical fully convolutional network FCN-8 of (Long2015fully, ) and the more recent state-of-the-art fully convolutional densenet (FC-DenseNet103) of (Jegou17, ), which do not make use of any additional synthetic data to boost their performances.

FCN-8 (Long2015fully, ): FCN-8 is a feedforward segmentation network, which consists of a convolutional downsampling path followed by a convolutional upsampling path. The downsampling path successively applies convolutional and pooling layers, and the upsampling path successively applies transposed convolutional layers. The upsampling path recovers spatial information by merging features skipped from the various resolution levels on the contracting path.

FC-DenseNet103 (Jegou17, ): FC-DenseNet is a state-of-the-art feedforward segmentation network, that exploits the feature reuse idea of (DenseNet2016, ) and extends it to perform semantic segmentation. FC-DenseNet103 consists of a convolutional downsampling path, followed by a convolutional upsampling path. The downsampling path iteratively concatenates all feature outputs in a feedforward fashion. The upsampling path applies a transposed convolution to feature maps from the previous stage and recovers information from higher resolution features from the downsampling path of the network by using skip connections.

4.3 DAE architecture

Our DAE is composed of a downsampling path and an upsampling path. The downsampling path contains convolutions and pooling operations, while the upsampling path is built from unpooling with switches (also known as unpooling with index tracking) (ZhaoMGL15, ; ZhangLL16, ; SegNet2015, ) and convolution operations. As discussed in (ZhangLL16, )

, reverting the max pooling operations more faithfully, significantly improves the quality of the reconstructed images. Moreover, while exploring potential network architectures, we found out that using fully convolutional-like architectures with upsampling and skip connections (between downsampling and upsampling paths) decreases segmentation results when compared to unpooling with switches. This is not surprising, since we inject noise to the model’s input when training the DAE. Skip connections directly propagate this added noise to the end layers; making them responsible for the data denoising process. Note that the last layers of the model might not have enough capacity to accomplish the denoising task.

In our experiments, we use DAE built from 6 interleaved pooling and convolution operations, followed by 6 interleaved unpooling and convolution operations. We start with 64 feature maps in the first convolution and duplicate the number of feature maps in consecutive convolutions in the downsampling path. Thus, the number of feature maps in the network’s downsampling path is: 64, 128, 256, 512, 1024 and 2048. In the upsampling path, we progressively reduce the number of feature maps up to the number of classes. Thus, the number of feature maps in consecutive layers of the upsampling path is the following: 1024, 512, 256, 128, 64 and 11 (number of classes). We concatenate the output of 4th pooling operation in downsampling path of DAE together with the feature maps corresponding to 4th pooling operation in downsampling path of the segmentation network.

4.4 Training and inference details

We train our DAE by means of stochastic gradient descent with RMSprop 

(rmsprop, ), initializing the learning rate to and applying an exponential decay of

after each epoch. All models are trained with data augmentation, randomly applying crops of size

and horizontal flips. We regularize our model with a weight decay of . We use a minibatch size of 10. While training, we add zero-mean Gaussian noise ( or ) to the DAE input. We train the models for a maximum of 500 epochs and monitor the validation reconstruction error to early stop the training using a patience of 100 epochs.

At test time, we need to determine the step size and the number of iterations to get the final segmentation output. We select and the number of iterations by evaluating the pipeline on the validation set. Therefore, we try for a maximum number of 50 iterations. For each iteration, we compute the mean intersection over union (mean IoU) on the validation set and keep the combination (, number of iterations) that maximizes this metric to evaluate the test set.222The code to reproduce all experiments can be found here: The code requires the framework in Visin_dataset_loaders to load and preprocess the data.

4.5 Results













Mean IoU

Gl. accuracy

FCN8 Long2015fully
FCN8 + CRF 90.1 36.1
FCN8 + con. mod. YuKoltun2016 90.1
FCN8 + CRF-RNN (CRFasRNN, ) 22.3 30.1
FCN8 + DAE()
FCN8 + DAE( 80.0 92.1 75.3 72.6 80.3 46.2 42.5 60.0 89.3
FC-DenseNet Jegou17 94.4
FC-DenseNet + CRF 93.2 83.8 77.9 46.3 38.3 77.4 51.7 91.7
FC-DenseNet + con. mod. YuKoltun2016 94.4 77.4
FC-DenseNet + DAE() 94.4
FC-DenseNet + DAE( 38.8 94.4 82.5 60.3 67.4 91.7
Table 1: Results on CamVid dataset test set, using different segmentation networks. DAE() corresponds to training scenario 2 and DAE() corresponds to training scenario 1 from Figure 1.

Table 1 reports our results for FCN-8 and FC-DenseNet103 without any post-processing step, applying fully connected CRF Koltun11 , context network YuKoltun2016 as trained post-processing step, CRF-RNN CRFasRNN trained end-to-end with the segmentation network and DAE’s iterative inference. For CRF, we use publicly available implementation of Koltun11 .

(a) Legend
(b) Image
(c) Ground Truth
(d) FCN-8
(e) FCN-8 + CRF
(f) FCN-8 + DAE
(g) Image
(h) Ground Truth
(i) FC-DenseNet
(j) FC-DenseNet + CRF
(k) FC-DenseNet + DAE
Figure 3: Qualitative results. Main differences are marked with white boxes.

As shown in the table, using DAE’s iterative inference on the segmentation candidates of a feedforward segmentation network (DAE()) outperforms state-of-the-art post-processing variants; improving upon FCN-8 by a margin of IoU. When applying CRF as a post-processor, the FCN-8 segmentation results improve . Note that similar improvements for CRF were reported on other architectures for the same dataset (e.g.  (SegNet2015, )). Similar improvements are achieved when using the context module YuKoltun2016 a post-processor () and when applying CRF-RNN (). It is worth noting that our method does not decrease the performance of any class with respect to FCN-8. However, CRF loses when segmenting column poles, whereas CRF-RNN loses when segmenting signs. When it comes to more recent state-of-the-art architectures such as FC-DenseNet103, the post-processing increment on the segmentation metrics is lower, as expected. Nevertheless, the improvement is still perceivable (+ in IoU). When comparing our method to other state-of-the-art post-processors, we observe a slight improvement. End-to-end training of CRF-RNN with FC-DenseNet103 did not yield any improvement over FC-DenseNet103.

It is worth comparing the performance of the proposed approach DAE() with DAE() trained from the ground truth. As shown in the table, DAE( consistently outperforms DAE(. For FCN-8, the proposed method outperforms DAE( by a margin of . For FC-DenseNet103, differences are smaller but still noticeable. In both cases, DAE() not only outperforms DAE() globally, but also in all classes that exhibit an improvement. Note that the model trained on the ground truth requires a bigger Gaussian noise in order to slightly increase the performance of the pre-trained feedforward segmentation networks. It is worth mentioning that training our model end-to-end with the segmentation network didn’t improve the results, while being more memory demanding.

Figure 3 shows some qualitative segmentation results that compare the output of the feedforward network to both the CRF and iterative inference outputs. Figures 3(b)-3(f) show an example from the FCN-8 case, where as Figures 3(g)-3(k) show an example from FC-DenseNet103. As shown in Figure 3(d)

, the FCN-8 segmentation network fails to properly find the fence in the image, mistakenly classifying it as part of a building (highlighted with a white box on the image). CRF is able to clean the segmentation candidate, for example, by filling in missing parts of the sidewalk but is not able to add non-existing structure (see Figure

3(e)). Our method not only improves the segmentation candidate by smoothing large regions such as the sidewalk, but also corrects the prediction by incorporating missing objects such as the fence on Figure 3(f). As depicted in Figures 3(g)-3(k), in case of FC-DenseNet the improvement in segmentation quality is minor and difficult to perceive by visual inspection. The qualitative results follow the findings from quantitative analysis, CRF decreases slightly the quality of column pole segmentations (e. g. see area inside white boxes when comparing Figures 3(j) and 3(k)).

4.6 Analysis of iterative inference steps

In this subsection, we analyze the influence of the two inference parameters of our method, namely the step size and the number of iterations. This analysis is performed on the validation set of CamVid dataset, for the above-mentioned feedforward segmentation networks. For the sake of comparison, we perform a similar analysis on densely connected CRF; by fixing the best configuration and only changing the number of CRF iterations.

Figure 4 shows how the performance varies with number of iterations. Figure 4(a) and Figure 4(b) plot the results in the case of FCN-8 and FC-DenseNet103, respectively. As expected, there is a trade-off between the selected step size and the number of iterations. The smaller the , the more iterations are required to achieve the best performance. Interestingly, all within a reasonable range lead to similar maximum performances.

(a) FCN-8
(b) FC-DenseNet103
Figure 4: Results at inference time. X-axis denotes number of iteration and Y-axis is the Intersection over Union (IoU). The plots were obtained on Camvid’s validation set.

5 Conclusions

We have proposed to use a novel form of denoising autoencoders for iterative inference in structured output tasks such as image segmentation. The autoencoder is trained to map corrupted predictions to target outputs and iterative inference interprets the difference between the output and the input as a direction of improved output configuration, given the input image.

The evidence obtained through the experiments provide positive evidence for the three questions raised at the beginning of Sec. 4: (1) a conditional DAE can be used successfully as the building block of iterative inference for image segmentation, (2) the proposed corruption model (based on the feedforward prediction) works better than the prescribed target output corruption, and (3) the resulting segmentation system outperforms state-of-the-art methods for obtaining coherent outputs.


The authors would like to thank the developers of Theano

Theano-2016short , Lasagne lasagne and the dataset loader framework Visin_dataset_loaders . We acknowledge the support of the following agencies for research funding and computing support: Imagia, CIFAR, Canada Research Chairs, Compute Canada and Calcul Québec, as well as NVIDIA for the generous GPU support. Special thanks to Laurent Dinh for useful discussions and support.