Based on response timing and propagation delays in the brain, a plausible hypothesis is that the visual cortex can perform fast feedforward (ThorpeFizeMarlot96, ) inference when an answer is needed quickly and the image interpretation is easy enough (requiring as little as 200ms of cortical propagation for some object recognition tasks, i.e., just enough time for a single feedforward pass) but needs more time and iterative inference in the case of more complex inputs (Vanmarcke-et-al-2016, )
. Recent deep learning research and the success of residual networks(He2016DeepRL, ; GreffSS16, ) point towards a similar scenario (Liao2016BridgingTG, )
: early computation which is feedforward, a series of non-linear transformations which map low-level features to high-level ones, while later computation is iterative (using lateral and feedback connections in the brain) in order to capture complex dependencies between different elements of the interpretation. Indeed, whereas a purely feedforward network could model a unimodal posterior distribution (e.g., the expected target with some uncertainty around it), the joint conditional distribution of output variables given inputs could be more complex and multimodal. Iterative inference could then be used to either sample from this joint distribution or converge towards a dominant mode of that distribution, whereas a unimodal-output feedfoward network would converge to some statistic like the conditional expectation, which might not correspond to a coherent configuration of the output variables when the actual conditional distribution is multimodal.
This paper proposes such an approach combining a first feedforward phase with a second iterative phase corresponding to searching for a dominant mode of the conditional distribution while tackling the problem of semantic image segmentation. We take advantage of theoretical results (Alain+Bengio-ICLR2013, ) on denoising autoencoder (DAE), which show that they can estimate the score or negative gradient of the energy function of the joint distribution of observed variables: the difference between the reconstruction and the input points in the direction of that estimated gradient. We propose to condition the autoencoder with an additional input so as to obtain the conditional score, i.e., the gradient of the energy of the conditional density of the output variables, given the input variables. The autoencoder takes a candidate output as well as an input and outputs a value so that estimates the direction . We can then take a gradient step in that direction and update towards a lower-energy value and iterate in order to approach a mode of the implicit captured by the autoencoder. We find that instead of corrupting the segmentation target as input of the DAE, we obtain better results by training the DAE with the corrupted feedforward prediction, which is closer to what will be seen as the initial state of the iterative inference process. The use of a denoising autoencoder framework to estimate the gradient of the energy is an alternative to more traditional graphical modeling approaches, e.g., with conditional random fields (CRFs) (Lafferty01CRF, ; He04MCR, ), which have been used to model the joint distribution of pixel labels given an image (Koltun11, ). The potential advantage of the DAE approach is that it is not necessary to decide on an explicitly parametrized energy function: such energy functions tend to only capture local interactions between neighboring pixels, whereas a convolutional DAE can potentially capture dependencies of any order and across the whole image, taking advantage of the state-of-the-art in deep convolutional architectures in order to model these dependencies via the direct estimation of the energy function gradient. Note that this is different from the use of convolutional networks for the feedforward part of the network, and regards the modeling of the conditional joint distribution of output pixel labels, given image features.
The main contributions of this paper are the following:
A novel training framework for modeling structured output conditional distributions which is an alternative to CRFs, based on denoising autoencoder estimation of energy gradients.
Showing how this framework can be used in an architecture for image pixelwise segmentation in which the above energy gradient estimator is used to propose a highly probable segmentation through gradient descent in the output space.
Demonstrating that this approach to image segmentation outperforms or matches classical alternatives such as combining convolutional nets with CRFs and more recent state-of-the-art alternatives on the CamVid dataset.
In this section, we describe the proposed iterative inference method to refine the segmentation of a feedforward network.
As pointed in section 1, DAE can estimate a density via an estimator of the score or negative gradient of the energy function (Vincent2010SDA, ; Vincent-NC-2011, ; Alain+Bengio-ICLR2013, ). These theoretical analyses of DAE are presented for the particular case where the corruption noise added to the input is Gaussian. Results show that DAE can estimate the gradient of the energy function of a joint distribution of observed variables. The main result is the following:
where is the amount of Gaussian noise injected during training, is the input of the autoencoder and is its output (the reconstruction). The approximation becomes exact as and the autoencoder is given enough capacity, training examples and training time. The direction of points towards more likely configurations of
. Therefore the DAE learns a vector field pointing towards the manifold where the input data lies.
2.2 Our framework
In our case, we seek to rapidly learn a vector field pointing towards more probable configurations of . We propose to extend the results summarized in subsection 2.1 and condition the autoencoder with an additional input. If we condition the autoencoder with features , which are a function of , the DAE framework with Gaussian corruption learns to estimate , where is a segmentation candidate, an input image and is an energy function. Gradient descent in energy can thus be performed in order to iteratively reach a mode of the estimated conditional distribution:
with step size . In addition, whereas Gaussian noise around the target would be the DAE prescription for the corrupted input to be mapped to , this may be inefficient at visiting the configurations we really care about, i.e., those produced by our feedforward predictor, which we use to obtain a first guess for , as initialization of the iterative inference towards an energy minimum. Therefore, we propose that during training, instead of using a corrupted as input, the DAE takes as input a corrupted segmentation candidate and either the input or some features extracted from a feedforward segmentation network applied to :
where is a non-linear function and is the index of a layer in the feedforward segmentation network. The output of the DAE is computed as
where is a non-linear function which is trained to denoise conditionally and is a corrupted form of . During training, is plus noise, while at test time (for inference) it is simply itself.
In order to train the DAE, (1) we extract both and from a feedforward segmentation network; (2) we corrupt into ; and (3) we train the DAE by minimizing the following loss
where is the categorical cross-entropy and is the segmentation ground truth.
Figure 1(a) depicts the pipeline during training. First, a fully convolutional feedforward network for segmentation is trained. In practice, we use one of the state-of-the-art pre-trained networks. Second, given an input image , we extract a segmentation proposal and intermediate features from the segmentation network. Both and are fed to a DAE network (adding Gaussian noise to ). The DAE is trained to properly reconstruct the clean segmentation (ground truth ). Figure 1(b) presents the original DAE prescription , where the DAE is trained by taking as input and .
Once trained, we can exploit the trained model to iteratively take gradient steps in the direction of the segmentation manifold. To do so, we first obtain a segmentation proposal from the feedforward network and then we iteratively refine this proposal by applying the following rule
For practical reasons, we collapsed the corruption noise into the step size .
Figure 2 depicts the test pipeline. We start with an input image that we feed to a pre-trained segmentation network. The segmentation networks outputs some intermediate feature maps and a segmentation proposal . Then, both and are fed to the DAE to compute the output . The DAE is used to take iterative gradient steps towards the manifold of segmentation masks, with no noise added at inference time.
3 Related Work
On one hand, recent advances in semantic segmentation mainly focus on improving the architecture design (ronneberger2015u, ; SegNet2015, ; DrozdzalVCKP16, ; Jegou17, ), increasing the context understanding capabilities (Gatta14-deepvision, ; VisinKCBMC15, ; chen14semantic, ; YuKoltun2016, ) and building processing modules to enforce structure consistency to segmentation outputs (Koltun11, ; chen14semantic, ; CRFasRNN, ). Here, we are interested in this last research direction. CRFs are among the most popular choices to impose structured information at the output of a segmentation network, being fully connected CRFs (Koltun11, ) and CRFs as RNNs CRFasRNN among best performing variants. More recently, an alternative to promote structure consistency by decomposing the prediction process into multiple steps and iteratively adding structure information, was introduced in (li2016iterative, ). Another iterative approach was introduced in GidarisK16a to tackle image semantic segmentation by repeatedly detecting, replacing and refining segmentation masks. Finally, the reinterpretation of residual networks Liao2016BridgingTG ; GreffSS16 was exploited in DrozdzalCVDTRBP17 , in the context of biomedical image segmentation, by iteratively refining learned pre-normalized images to generate segmentation predictions.
On the other hand, there has recently been some research devoted to exploit results of DAE on different tasks, such as image generation (NguyenYBDC16, ), high resolution image estimation (Sonderby2017, ) and semantic segmentation (Xie2016, ). In (NguyenYBDC16, ), authors propose plug & play generative networks, which, in the best reported results, train a fully-connected DAE to reconstruct a denoised version of some feature maps extracted from an image classification network. The iterative update rule at inference time is performed in the feature space. In (Sonderby2017, )
, authors use DAE in the context of image super-resolution to learn the gradient of the density of high resolution images and apply it to refine the output of an upsampled low resolution image. In(Xie2016, ), authors exploit convolutional pseudo priors trained on the ground-truth labels in semantic segmentation task. During the training phase, the pseudo-prior is combined with the segmentation proposal from a segmentation model to produce joint distribution over data and labels. At test time, the ground truth is not accessible, thus the FCN predictions are fed iteratively to the convolutional pseudo-prior network. In this work, we exploit DAEs in the context of image segmentation and extend them in two ways, first by using them to learn a conditional score, and second by using a corrupted feedforward prediction as input during training to obtain better segmentation results.
The main objective of these experiments is to answer the following questions:
Can a conditional DAE be used successfully as the building block of iterative inference for image segmentation?
Does our proposed corruption model (based on the feedforward prediction) work better than the prescribed target output corruption?
Does the resulting segmentation system outperform more classical iterative approaches to segmentation such as CRFs?
4.1 CamVid Dataset
is a fully annotated urban scene understanding dataset. It contains videos that are fully segmented. We used the same split and image resolution as(SegNet2015, ). The split contains 367 images (video frames) for training, 101 for validation and 233 for test. Each frame has a resolution of 360x480 and pixels are labeled with 11 different semantic classes.
4.2 Feedforward segmentation architecture
We experimented with two feedforward architectures for segmentation: the classical fully convolutional network FCN-8 of (Long2015fully, ) and the more recent state-of-the-art fully convolutional densenet (FC-DenseNet103) of (Jegou17, ), which do not make use of any additional synthetic data to boost their performances.
FCN-8 (Long2015fully, ): FCN-8 is a feedforward segmentation network, which consists of a convolutional downsampling path followed by a convolutional upsampling path. The downsampling path successively applies convolutional and pooling layers, and the upsampling path successively applies transposed convolutional layers. The upsampling path recovers spatial information by merging features skipped from the various resolution levels on the contracting path.
FC-DenseNet103 (Jegou17, ): FC-DenseNet is a state-of-the-art feedforward segmentation network, that exploits the feature reuse idea of (DenseNet2016, ) and extends it to perform semantic segmentation. FC-DenseNet103 consists of a convolutional downsampling path, followed by a convolutional upsampling path. The downsampling path iteratively concatenates all feature outputs in a feedforward fashion. The upsampling path applies a transposed convolution to feature maps from the previous stage and recovers information from higher resolution features from the downsampling path of the network by using skip connections.
4.3 DAE architecture
Our DAE is composed of a downsampling path and an upsampling path. The downsampling path contains convolutions and pooling operations, while the upsampling path is built from unpooling with switches (also known as unpooling with index tracking) (ZhaoMGL15, ; ZhangLL16, ; SegNet2015, ) and convolution operations. As discussed in (ZhangLL16, )
, reverting the max pooling operations more faithfully, significantly improves the quality of the reconstructed images. Moreover, while exploring potential network architectures, we found out that using fully convolutional-like architectures with upsampling and skip connections (between downsampling and upsampling paths) decreases segmentation results when compared to unpooling with switches. This is not surprising, since we inject noise to the model’s input when training the DAE. Skip connections directly propagate this added noise to the end layers; making them responsible for the data denoising process. Note that the last layers of the model might not have enough capacity to accomplish the denoising task.
In our experiments, we use DAE built from 6 interleaved pooling and convolution operations, followed by 6 interleaved unpooling and convolution operations. We start with 64 feature maps in the first convolution and duplicate the number of feature maps in consecutive convolutions in the downsampling path. Thus, the number of feature maps in the network’s downsampling path is: 64, 128, 256, 512, 1024 and 2048. In the upsampling path, we progressively reduce the number of feature maps up to the number of classes. Thus, the number of feature maps in consecutive layers of the upsampling path is the following: 1024, 512, 256, 128, 64 and 11 (number of classes). We concatenate the output of 4th pooling operation in downsampling path of DAE together with the feature maps corresponding to 4th pooling operation in downsampling path of the segmentation network.
4.4 Training and inference details
after each epoch. All models are trained with data augmentation, randomly applying crops of sizeand horizontal flips. We regularize our model with a weight decay of . We use a minibatch size of 10. While training, we add zero-mean Gaussian noise ( or ) to the DAE input. We train the models for a maximum of 500 epochs and monitor the validation reconstruction error to early stop the training using a patience of 100 epochs.
At test time, we need to determine the step size and the number of iterations to get the final segmentation output. We select and the number of iterations by evaluating the pipeline on the validation set. Therefore, we try for a maximum number of 50 iterations. For each iteration, we compute the mean intersection over union (mean IoU) on the validation set and keep the combination (, number of iterations) that maximizes this metric to evaluate the test set.222The code to reproduce all experiments can be found here: https://github.com/adri-romsor/iterative_inference_segm. The code requires the framework in Visin_dataset_loaders to load and preprocess the data.
|FCN8 + CRF||90.1||36.1|
|FCN8 + con. mod. YuKoltun2016||90.1|
|FCN8 + CRF-RNN (CRFasRNN, )||22.3||30.1|
|FCN8 + DAE()|
|FCN8 + DAE(||80.0||92.1||75.3||72.6||80.3||46.2||42.5||60.0||89.3|
|FC-DenseNet + CRF||93.2||83.8||77.9||46.3||38.3||77.4||51.7||91.7|
|FC-DenseNet + con. mod. YuKoltun2016||94.4||77.4|
|FC-DenseNet + DAE()||94.4|
|FC-DenseNet + DAE(||38.8||94.4||82.5||60.3||67.4||91.7|
Table 1 reports our results for FCN-8 and FC-DenseNet103 without any post-processing step, applying fully connected CRF Koltun11 , context network YuKoltun2016 as trained post-processing step, CRF-RNN CRFasRNN trained end-to-end with the segmentation network and DAE’s iterative inference. For CRF, we use publicly available implementation of Koltun11 .
As shown in the table, using DAE’s iterative inference on the segmentation candidates of a feedforward segmentation network (DAE()) outperforms state-of-the-art post-processing variants; improving upon FCN-8 by a margin of IoU. When applying CRF as a post-processor, the FCN-8 segmentation results improve . Note that similar improvements for CRF were reported on other architectures for the same dataset (e.g. (SegNet2015, )). Similar improvements are achieved when using the context module YuKoltun2016 a post-processor () and when applying CRF-RNN (). It is worth noting that our method does not decrease the performance of any class with respect to FCN-8. However, CRF loses when segmenting column poles, whereas CRF-RNN loses when segmenting signs. When it comes to more recent state-of-the-art architectures such as FC-DenseNet103, the post-processing increment on the segmentation metrics is lower, as expected. Nevertheless, the improvement is still perceivable (+ in IoU). When comparing our method to other state-of-the-art post-processors, we observe a slight improvement. End-to-end training of CRF-RNN with FC-DenseNet103 did not yield any improvement over FC-DenseNet103.
It is worth comparing the performance of the proposed approach DAE() with DAE() trained from the ground truth. As shown in the table, DAE( consistently outperforms DAE(. For FCN-8, the proposed method outperforms DAE( by a margin of . For FC-DenseNet103, differences are smaller but still noticeable. In both cases, DAE() not only outperforms DAE() globally, but also in all classes that exhibit an improvement. Note that the model trained on the ground truth requires a bigger Gaussian noise in order to slightly increase the performance of the pre-trained feedforward segmentation networks. It is worth mentioning that training our model end-to-end with the segmentation network didn’t improve the results, while being more memory demanding.
Figure 3 shows some qualitative segmentation results that compare the output of the feedforward network to both the CRF and iterative inference outputs. Figures 3(b)-3(f) show an example from the FCN-8 case, where as Figures 3(g)-3(k) show an example from FC-DenseNet103. As shown in Figure 3(d)
, the FCN-8 segmentation network fails to properly find the fence in the image, mistakenly classifying it as part of a building (highlighted with a white box on the image). CRF is able to clean the segmentation candidate, for example, by filling in missing parts of the sidewalk but is not able to add non-existing structure (see Figure3(e)). Our method not only improves the segmentation candidate by smoothing large regions such as the sidewalk, but also corrects the prediction by incorporating missing objects such as the fence on Figure 3(f). As depicted in Figures 3(g)-3(k), in case of FC-DenseNet the improvement in segmentation quality is minor and difficult to perceive by visual inspection. The qualitative results follow the findings from quantitative analysis, CRF decreases slightly the quality of column pole segmentations (e. g. see area inside white boxes when comparing Figures 3(j) and 3(k)).
4.6 Analysis of iterative inference steps
In this subsection, we analyze the influence of the two inference parameters of our method, namely the step size and the number of iterations. This analysis is performed on the validation set of CamVid dataset, for the above-mentioned feedforward segmentation networks. For the sake of comparison, we perform a similar analysis on densely connected CRF; by fixing the best configuration and only changing the number of CRF iterations.
Figure 4 shows how the performance varies with number of iterations. Figure 4(a) and Figure 4(b) plot the results in the case of FCN-8 and FC-DenseNet103, respectively. As expected, there is a trade-off between the selected step size and the number of iterations. The smaller the , the more iterations are required to achieve the best performance. Interestingly, all within a reasonable range lead to similar maximum performances.
We have proposed to use a novel form of denoising autoencoders for iterative inference in structured output tasks such as image segmentation. The autoencoder is trained to map corrupted predictions to target outputs and iterative inference interprets the difference between the output and the input as a direction of improved output configuration, given the input image.
The evidence obtained through the experiments provide positive evidence for the three questions raised at the beginning of Sec. 4: (1) a conditional DAE can be used successfully as the building block of iterative inference for image segmentation, (2) the proposed corruption model (based on the feedforward prediction) works better than the prescribed target output corruption, and (3) the resulting segmentation system outperforms state-of-the-art methods for obtaining coherent outputs.
The authors would like to thank the developers of TheanoTheano-2016short , Lasagne lasagne and the dataset loader framework Visin_dataset_loaders . We acknowledge the support of the following agencies for research funding and computing support: Imagia, CIFAR, Canada Research Chairs, Compute Canada and Calcul Québec, as well as NVIDIA for the generous GPU support. Special thanks to Laurent Dinh for useful discussions and support.
-  Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data generating distribution. In International Conference on Learning Representations (ICLR’2013), 2013.
-  Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. CoRR, abs/1511.00561, 2015.
Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla.
Segmentation and recognition using structure from motion point
European Conference on Computer Vision (ECCV), 2008.
-  Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. 2015.
-  Michal Drozdzal, Gabriel Chartrand, Eugene Vorontsov, Lisa Di-Jorio, An Tang, Adriana Romero, Yoshua Bengio, Chris Pal, and Samuel Kadoury. Learning normalized inputs for iterative estimation in medical image segmentation. CoRR, abs/1702.05174, 2017.
-  Michal Drozdzal, Eugene Vorontsov, Gabriel Chartrand, Samuel Kadoury, and Chris Pal. The importance of skip connections in biomedical image segmentation. CoRR, abs/1608.04117, 2016.
-  A. Romero F. Visin. Dataset loaders: a python library to load and preprocess datasets. https://github.com/fvisin/dataset_loaders, 2017.
Carlo Gatta, Adriana Romero, and Joost van de Weijer.
Unrolling loopy top-down semantic feedback in convolutional deep
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) workshop, 2014.
-  Spyros Gidaris and Nikos Komodakis. Detect, replace, refine: Deep structured prediction for pixel wise labeling. CoRR, abs/1612.04770, 2016.
-  Klaus Greff, Rupesh Kumar Srivastava, and Jürgen Schmidhuber. Highway and residual networks learn unrolled iterative estimation. CoRR, abs/1612.07771, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
-  Xuming He, Richard S. Zemel, and Miguel Á. Carreira-Perpiñán. Multiscale conditional random fields for image labeling. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR’04, pages 695–703, Washington, DC, USA, 2004. IEEE Computer Society.
-  Gao Huang, Zhuang Liu, Kilian Q. Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.
-  Simon Jégou, Michal Drozdzal, David Vázquez, Adriana Romero, and Yoshua Bengio. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In Workshop on Computer Vision in Vehicle Technology CVPRW, 2017.
-  Philipp Krähenbühl and Vladlen Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. 2011.
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira.
Conditional random fields: Probabilistic models for segmenting and
labeling sequence data.
Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
-  Lasagne. Lasagne. https://github.com/Lasagne/Lasagne, 2016.
-  Ke Li, Bharath Hariharan, and Jitendra Malik. Iterative instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3659–3667, 2016.
-  Qianli Liao and Tomaso A. Poggio. Bridging the gaps between residual learning, recurrent neural networks and visual cortex. CoRR, abs/1604.03640, 2016.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. 2015.
-  Anh Nguyen, Jason Yosinski, Yoshua Bengio, Alexey Dosovitskiy, and Jeff Clune. Plug & play generative networks: Conditional iterative generation of images in latent space. CoRR, abs/1612.00005, 2016.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICAI), 2015.
-  Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised MAP inference for image super-resolution. International Conference on Learning Representations, 2017.
-  Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016.
-  S. Thorpe, D. Fize, and C. Marlot. Speed of processing in the human visual system. Nature, 381:520, 1996.
T. Tieleman and G. Hinton.
rmsprop adaptive learning.
COURSERA: Neural Networks for Machine Learning, 2012.
-  Steven Vanmarcke, Filip Calders, and Johan Wagemans. The time-course of ultrarapid categorization: The influence of scene congruency and top-down processing. i-Perception, 2016.
-  Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, July 2011.
-  Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res., 11:3371–3408, December 2010.
Francesco Visin, Marco Ciccone, Adriana Romero, Kyle Kastner, Kyunghyun Cho,
Yoshua Bengio, Matteo Matteucci, and Aaron Courville.
Reseg: A recurrent neural network-based model for semantic segmentation.In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) workshop, 2016.
-  Saining Xie, Xun Huang, and Zhuowen Tu. Top-Down Learning for Structured Labeling with Convolutional Pseudoprior, pages 302–317. Springer International Publishing, Cham, 2016.
-  Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. 2016.
-  Yuting Zhang, Kibok Lee, and Honglak Lee. Augmenting supervised neural networks with unsupervised objectives for large-scale image classification. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 612–621, 2016.
-  Junbo Jake Zhao, Michaël Mathieu, Ross Goroshin, and Yann LeCun. Stacked what-where auto-encoders. CoRR, abs/1506.02351, 2015.
-  Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip Torr. Conditional random fields as recurrent neural networks. In International Conference on Computer Vision (ICCV), 2015.