1 Introduction
Deep learning [1] has enjoyed remarkable success in a wide range of fields including speech recognition [2], image processing [3]
[4]and natural language processing
[5]. One of the key factors in the rise of deep learning has been the availability of largescale annotated/labeled datasets such as Imagenet
[6]. While for some tasks, such as classification, the collection of largescale groundtruth annotations is tolerable, in other tasks it is extremely tedious and expensive. For instance, it has been shown recently [7] that pixelwise labelling for semantic segmentation on Pascal VOC 2012 [8] takes about 4 minutes per image.Beyond standard human annotation, there exist two main data labelling approaches. The first, annotations using simulated data (see e.g. [9, 10, 11, 12, 13, 14]
), requires some expertise in terms of the behavior of the signals under consideration; further, there is often a domain shift between real and simulated data. In the second approach, which applies to imageprocessing tasks such as denoising, deblurring, superresolution and compression, one can create the inputs to the network as a simple degradation of the original outputs/labels. For example, this technique was used and extended in
[3].We propose a different approach, which circumvents the annotation issue entirely, by relying on an externally specified energy function. This energy is of a particular form: it is a function of both the data and the label, and when optimized over the desired label (holding the data fixed) it gives a prediction for the label given the data. Where would such an energy function come from? Fortunately, the computer vision and image processing literature is replete with examples of such functions for solving e.g., depth from stereo [15, 16], superresolution [17], singleimage dehazing [18]
and optical flow estimation
[19]. These energies typically incorporate terms which ensure fidelity to the observed data, as well as some sort of signal prior. With this in mind, the main thrust of this paper is as follows: given such an energy function, we show how to formulate learning on an entirely unlabelled dataset. Thus, one can save immensely on the cost of annotation, by dispensing with it entirely. Alternatively, with the proposed scheme, one could offer a possible fusion with supervised learning. The outcome of our method can be used as initial nearperfect automatic annotations, later to be manually adjusted in a much shorter time in order to create labeled DB for supervised learning. Also, if a small group of labeled data is given, this could be fused as well to boost the overall scheme.
In this context, a natural question might be: given the energy function, why bother learning at all – why not simply optimize the energy at runtime? There are two main answers to this question. The first is speed: once the network has been learned, the solution can be obtained by an almost instant forwardpass, which is typically orders of magnitude faster than explicit optimization. The second answer relates to regularization, from two points of view. Just as in [20], the mere choice of a specific CNN architecture to represent the prediction acts as a successful regularizer. Further, having the network fit its solution to a variety of images during training acts as an additional regularization.
The main contributions of this paper are as follows. First, we propose the Deep Energy approach for learning on unlabelled datasets (Section 3.1). Second, we show how to apply this formalism to two example problems, seeded segmentation (Section 3.2) and image matting (Section 3.3). Third, we learn CNNs for each of these problems on unlabelled datasets, and show performance which often exceeds the analytical solution to the corresponding explicit energy minimization (Section 4).
2 Related Work
Training CNNs via Energy Minimization Our approach is inspired by recent work on image style transfer. Given “content” and “style” images, the objective of [21] is to create a new image of the same content, in the style of a famous painting. The key to this method is the use of an energy function, which includes two terms. The first, a data fidelity term that forces the new image to be close to the content image in terms of a perceptual loss. The second, a “prior” or “physical” term, which promotes closeness in terms of secondorder statistics between the newlycreated and style images. Instead of minimizing this energy function for each input image, the works reported in [22, 23] propose training a fully convolutional network to do so. During the training, as opposed to being given pairs of regular and stylized images and training the network in a standard supervised fashion, the authors proceed in an unsupervised way, via direct minimization of the energy function previously described. By the end of the training process, the network has learned to transform any input image into a stylized artistic image, by a fast and simple forwardpass operation.
Another line of work which is somewhat related to our approach is joint training of neural networks and graphical models
[24, 25]. The work reported in [24] tackles the task of single image depth estimation, using a neural network that is trained to provide the unary and pairwise potentials for a continuous CRF (Conditional Random Field). As opposed to supervised training, which regresses a neural network over groundtruth depth maps, their training of the unary and pairwise subnetworks involves an explicit minimization of a loglikelihood expression, modeling the relations between neighboring pixels. At test time, the two subnetworks provide the potentials that are fed to a closedform analytic solution of the objective function used during training. In [25], the authors propose a hybrid between CNNs and MRFs (Markov Random Fields) in order to solve the human pose estimation problem. Their solution consists of two networks, the first, a supervised feedforward partdetector providing rough heatmap estimations, and the second, a highlevel spatialmodel to enforce global pose consistency, formulated via CRFs. While these methods concentrate on specific applications and energy functions, we provide a broad endtoend formulation.Seeded Segmentation Seeded segmentation has enjoyed a considerable amount of attention with the introduction of the popular graphcuts [26, 27] and randomwalker [28] algorithms. With the emergence of deep learning and fully annotated datasets such as Pascal VOC 2012 [8] and Imagenet [6], the focus has shifted to fully automatic semantic segmentation [29]. A recent attempt to alleviate the burden of perpixel image annotation has been proposed in [7]. They suggest weak supervision in the form of userprovided points for the task of semantic segmentation. Their approach however does not originate from an energy minimization perspective and the weak supervision is only applied during training. At test time, regular semantic segmentation is performed.
Image Matting There have been numerous proposals for tackling the image matting problem. Since this problem is illposed, an additional regularization in the form of user interaction is required. Usually the user must input an additional “trimap”, a rough segmentation of the image containing foreground, background and unknown pixels. The unknown pixels lie in the border between the object and the background, and usually contain challenging features such as hair, fur, semitransparent objects, etc. In order to successfully output an accurate image matte, some iterative, often nonlinear, optimization is involved [30, 31, 32, 33, 34, 35]. Several deep learningbased techniques for image matting [9, 36, 37] have been proposed over the past two years. The work in [36] uses a CNN to learn a trimap, to be later fed as input to an analytic solver. In [37], several analytic solvers are used to create output alpha mattes for supervised training of a CNN. As opposed to these works, we circumvent the use of an analytic solver and directly train the network via an appropriate objective function. In [9] a large dataset for supervised image matting is created, using a composition of interesting matting objects over natural scene backgrounds. Using this dataset, endtoend image matting is performed. This method provides impressive results on simulated imagematting benchmarks, however it may suffer from domainshift when applied to reallife images.
3 The Deep Energy Approach
3.1 The General Approach
Suppose that given data , our goal is to compute the corresponding label ; it would be instructive to think of the label as another signal, rather than a discrete label as in classification. We assume the existence of an energy function for which
(1) 
This formulation is quite standard in signal processing, image processing, and computer vision; it is the essence of an inverse problem approach [38, 39, 40].
Our idea is to learn a network which will effectively solve problem (1). In particular, consider a network parameterized by parameters , such that the function is hypothesized to look like , where stands for “prediction”. Thus, learning how to solve the inverse problem is equivalent to learning the parameters . Assuming that our unlabelled training data consists of examples of the datum , , in order to learn we propose to solve:
(2) 
We essentially learn the network so that, averaged over all samples, the energy function is minimized (with respect to the network parameters). This is the deep energy approach: we learn a prediction function without seeing any labelled data, via an externally specified energy function.
The proposed scheme can serve any application that can be posed in terms of an informative energy function, for which there is a difficulty in collecting largescale data for supervised learning. Examples for such tasks are single image dehazing [18], optical flow [41], single image depth estimation [42], image retargetting [43], Retinex [44] and more. We turn now to demonstrate two specific such applications of Deep Energy: seeded segmentation and image matting.
3.2 Application: Seeded Segmentation
Background The first application we consider is seeded segmentation, in which the input consists of both an image and userprovided seeds, and the output is a segmentation of the image. These seeds, in the form of points or scribbles, provide the algorithm with an additional cue in computing the segmentation. Seeded segmentation has enjoyed a certain amount of popularity within the medical imaging community [45, 46], given the fact that it allows the user a degree of control over the final result. From our point of view, seeded segmentation is an interesting testbed for our Deep Energy approach, given the fact that there are standard energy functions for performing this task.
In particular, we use the energy function of the wellknown Random Walker algorithm of Grady et al. [28], which we now briefly detail. The image is represented as a weighted undirected graph , where the vertices are the pixels, and edges are between each pixel and its 4neighborhood. The weight of edge is given by , where are the grayscale values of the pixels and is a global scaling parameter.
The idea behind the random walker is the following: the probability of a random walk starting from a given pixel and reaching each one of the seeds/classes in the image, is also the final probability the pixel receives for that class. Formally, let
be the probability of each pixel in the image to belong to the label of class , where is the number of pixels in the image. We denote as the “seed image” for the class, whose elements are or , indicating absence or presence of a seed respectively.are their respective vectorized versions over
classes. Then, one can solve for the random walk probabilities by minimizing the following energy:(3) 
where is a matrix whose diagonal elements indicate the presence of a seed (of any class) at a given pixel; and is the Laplacian of the graph. The first term is a smoothness term, ensuring that the segmentation is spatially smooth, though discontinuities are penalized less where the underlying image is discontinuous. The second term encourages fidelity to the seeds.
Tensorization One can directly use the energy expression in Equation (3
) as a loss function during training. We now show how to convert this function to a form which is friendlier for the tensorprocessing which is common in deep learning packages. We rewrite Equation (
3) as(4) 
using a common expansion of the Laplacian quadratic form. Now, concatenate identical copies of the output along the last dimension to create , where is the number of neighbors ( in our case). Further, form , as a concatenation of the “neighbor images” of the output ; in the case of 4neighborhoods, the neighbor images are simply the image shifted left, right, up and down. Finally, the weights can be represented as an matrix; we then take copies of this matrix, and put them in the 3tensor . Then the energy function can be written as follows:
(5) 
where we have summed over the batch dimension, ; and the powers are taken elementwise.
3.3 Application: Image Matting
Background A second energy function which will illustrate the Deep Energy approach comes from the problem of image matting. In this task, one extracts an object from its background by determining the opacity and color of each pixel in the foreground. The input is an RGB image , which is a composition of the foreground image, , and the background image, . Each pixel in the image can be described by the matting equation:
(6) 
where is the opacity of each pixel, also called the alpha matte. Note that recovering the alpha matte from is extremely illposed. One has to deduce seven quantities per each pixel (the R, G, and B values for and , and the value of ).
In [30] the authors propose a closedform solution to the image matting problem, with the help of minimal user interaction, again in the form of seeds. They derive a cost function using local smoothness assumptions on the foreground and background images, and show that it is possible to entirely eliminate the dependence on these images, thus allowing a closedform global solution of the alpha matte. In particular, one may minimize the following energy:
(7) 
where is a diagonal matrix whose diagonal elements indicate the presence of a seed – either foreground or background – at a given pixel, and is the seed image, whose elements are or , indicating foreground or background respectively. Note that if a pixel is not a seed, can take on any value, as it will be zeroed out by the multiplication by . Finally, the matrix is a special Laplacianlike matrix, specific to the matting problem; as derived in [30], it is given by
(8) 
where is a patch around pixel ; and are the mean and covariance of the patch; and
is the identity matrix.
Tensorization We again rewrite the energy function in Equation (7) in a tensorfriendly format. Rephrasing the first term in (7) in terms of weights, we have that
(9) 
where we sum over all overlapping patches around pixels in the resulting alpha matte, as well as over all possible combinations of pixel pairs in a given patch, where the total number of combinations is . The weights, , are derived from the Laplacian and are given as:
(10) 
We can then tensorize the matting term and add the tensorized data term to get the final loss function:
(11) 
where indexes the pixel pairs in a patch and is the matrix of weights. are repetitions of the alpha matte; the first represents the alpha mattes in index and the second represents the alpha mattes in index . The data term of the energy function is exactly the same as that in seeded segmentation in Equation (5), apart from the minor difference that there is no need for summation over the classes since the resulting alpha matte is already dimensional, and and are simply the foreground and background seed images.
3.4 Architecture
Our network structure, shown in Figure 1, is fully convolutional and based on the Context Aggregation Network (CAN) introduced in [47]. Aggregation is performed by dilated convolutions with gradually increasing rates, that increase the receptive field and spread out the sparse seed data. We incorporate Resnetstyle [48] skip connections to allow for additional gradient flow through the network layers.
More specifically, the intermediate layers consist of a cascade of dilated residual blocks. Each such block contains consecutive regular conv layers, then a single dilatedconv layer with a dilation factor of where is the block number. We add skip connections between each block’s input and output as shown in fig. 1. Similar to Resnet [48], we connect the input and output by a simple addition, keeping the width of the output intact. Apart from the final layer, all convolution layers have filters with an output width of
. All layers, apart from the output layer and each block’s last dilated conv layer, are followed by batchnorm and ReLU nonlinearity. The dilatedconv layers in each block are followed by batchnorm. The final conv layer is a simple linear transformation performed with
convolution of width equal to the size of the desired output,. In seeded segmentation, there is an additional softmax layer that converts the output scores into probabilities.
4 Results
We now present the results of our deep energy loss on the seeded segmentation and image matting tasks. The results showcase the ability of deep energy learning to mimic the behavior of randomwalk segmentation [28] and closedform image matting [30].
Dataset We use the annotated training images in the Pascal VOC dataset [8] to create training and test images, for both tasks discussed. It is critical to note that although these images are fully annotated with pixelwise segmentation masks, we never make use of this information; rather, we simply use the masks to generate very simple seeds, which are inputs to both seeded segmentation and image matting as previously described. We now describe this process of seed generation.
We perform binary segmentation on each PASCAL VOC image, in which we isolate a single object from the rest of the image. If there exist several objects in the original image, we create several copies of the same image with different seeds in each distinct object. By looking at only one object at a time and eliminating images with very small objects, we create train and test images. For seeded segmentation, we perform data augmentation in the form of flips and rotations by and , increasing the number of training images to . To generate the seeds, we add a single line or circle marker for each object by first attempting to plant it in the center of mass, or if that fails, in a random location within the object. We resize the resulting images to either or .
Note that the Pascal VOC is not intended for image matting. There exist smaller datasets [49] that are less suitable for deep learning applications, or carefully controlled simulated datasets [9] that may not fully capture the expected behavior in the wild. We chose instead to demonstrate our approach on natural images, which are easy to obtain; and to use the minimal seeds described above rather than the complex trimaps commonly used in image matting [49, 9].
Implementation Details
We implement our technique in TensorFlow on a GTX TitanX Nvidia GPU. We use the Adam
[50] optimizer with an initial learning rate of with an exponential decay factor of every epochs. We use a batch size of and train the network for epochs over the training data. In seeded segmentation, we set , and . We train on images and cascade dilated residual blocks reaching a maximum dilation factor of . In image matting we set , train on images and again use a maximum dilation rate of . The training time for both seeded segmentation and image matting is hours.Qualitative Results We show the results of seeded segmentation on resized images from the test set in Figure 2. One can see that our network approximates the analytic solution well. There are even occasions where the network solution outperforms the analytic one, when compared with the ground truth; for example, see rows of Figure 2. In other cases (see row ), the network does not seem to disperse the seeds properly, a possible result of cascading limitedsized convolutions.
Since Pascal VOC is not intended for image matting, we show results in Figure 3 on images from the commonly used alphamatting.com [49] dataset, resized to . As opposed to our training process where we are given seeds, this dataset includes heavier user assistance in the form of trimaps. We treat the trimap as a seed image of foreground and background, with the missing gray pixels to be completed by the algorithm. Despite the very different nature of the train and test sets, the network performs quite well, demonstrating a good generalization ability. In the bottom rows, one can see blurriness in the network alpha mattes, which is a result of using a rather long chain of repeated convolutions. This can be remedied by an additional “refinement network” as suggested in [9], adapted to the deep energy loss. We leave this important direction for future work.
meanloss  meanIOU  meanMSE  
analytic  network  random  analytic  network  analytic  network  
segmentation  train  –  –  
test  –  –  
matting  train  –  –  –  –  
test  –  – 
Proximity to Analytic Solution Since we train the network to minimize a given energy function, we would like to verify how close it comes to actually minimizing this function, by comparing the energy values obtained. In seeded segmentation, the analytic solution is the closedform solution of the objective function in Equation (5). We plug in the analytic and network solutions to this equation, and get two sets of values, the analytic loss and the network loss. We repeat this process for image matting and compute the analytic solution of the energy function, given in Equation (7). For reference, we also add the average loss value for a solution provided via a freshlyinitialized network with random weights. The results in Table 1 show that the network loss is close to the analytic one (compared to initial “random” loss value), averaged over both train and test sets. The small gap between the network and analytic losses might be reduced through selection of different architectures, better suited for these specific tasks; or through the use of less arbitrary, more humanlike seeds.
analytic  network  speedup  analytic  network  speedup  analytic  network  speedup  

segmentation  
matting 
Accuracy In seeded segmentation, since our final goal is providing a segmentation map, we compare the performance of the analytic and network solutions in terms of the mIOU (mean intersection over union) metric, computed against the groundtruth segmentations of the train and test sets. The actual segmentation is performed by an argmax operation over the class dimension (consisting of foregroundbackground) of the output probabilities. The results in Table 1 show that the network outperforms the analytic solution in terms of mIOU (the higher the better). This result shows that although the network’s only job is to mimic the analytic solution by approximating the randomwalker energy function, it learns something about natural images which boosts the results. Note further the generalization ability of our approach: the segmentation performance of the test set as measured by mIOU is almost identical to that of the train set. This can be attributed to the fact that our training gains regularization through the use of the energy as a loss, rather than using a supervised formulation. In image matting, we evaluate the average matting quality on the training images from alphamatting.com versus the ground truth alpha mattes. To quantify the performance of both solutions, we use the MSE metric proposed by [49]. As one can see from the results in Table 1, the analytic and network solutions are further apart then in seeded segmentation. This can be attributed to the fact that the network has been trained on an entirely different dataset.
Speed We now compare the speed of the analytic vs. network solutions. The analytic solution is written in fully vectorized Numpy/Scipy code and the network is implemented in TensorFlow. Table 2 shows the average evaluation time of images of varying sizes using the analytic and network solutions. One can see a clear benefit in terms of speed in favor of our approach. As image dimensions grow, this increases, up to a factor of in seeded segmentation and in image matting.
5 Conclusion
We have introduced a new method of training CNNs by direct minimization of energy functions, which are commonly used in image processing and computer vision. We have shown application of this training technique to seeded segmentation and image matting, and demonstrated results which are comparable to, and sometimes exceed, the analytic solution. Future research will focus on fusing the deep energy loss with a standard supervised loss, to learn on datasets with few labelled examples.
References
 [1] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.
 [2] D. Amodei et al. Deep speech 2: Endtoend speech recognition in english and mandarin. In ICML, 2016.
 [3] Q. Chen, J. Xu, and V. Koltun. Fast image processing with fullyconvolutional networks. In ICCV, 2017.
 [4] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, 2014.
 [5] X. Zhang, J. Zhao, and Y. LeCun. Characterlevel convolutional networks for text classification. In NIPS, 2015.
 [6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
 [7] A. Bearman, O. Russakovsky, V. Ferrari, and L. FeiFei. What’s the point: Semantic segmentation with point supervision. In ECCV, 2016.
 [8] M. Everingham, L. Van Gool, C. KI. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010.
 [9] N. Xu, B. Price, S. Cohen, and T. Huang. Deep image matting. In CVPR, 2017.

[10]
W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao, and M. Yang.
Single image dehazing via multiscale convolutional neural networks.
In ECCV, 2016.  [11] E. Richardson, M. Sela, and R. Kimmel. 3d face reconstruction by learning from synthetic data. In 3D Vision (3DV), 2016 Fourth International Conference on, 2016.
 [12] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, 2014.
 [13] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016.
 [14] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In ICCV, 2015.
 [15] Y. Boykov and V. Kolmogorov. An experimental comparison of mincut/maxflow algorithms for energy minimization in vision. TPAMI, 26(9):1124–1137, 2004.
 [16] T. Meltzer, C. Yanover, and Y. Weiss. Globally optimal solutions for energy minimization in stereo vision using reweighted belief propagation. In ICCV, 2005.
 [17] M. Elad and A. Feuer. Restoration of a single superresolution image from several blurred, noisy, and undersampled measured images. IEEE transactions on image processing, 6(12):1646–1658, 1997.
 [18] K. He, J. Sun, and X. Tang. Single image haze removal using dark channel prior. TPAMI, 33(12):2341–2353, 2011.
 [19] A. Wedel, D. Cremers, T. Pock, and H. Bischof. Structureand motionadaptive regularization for high accuracy optic flow. In ICCV, 2009.
 [20] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior. arXiv preprint arXiv:1711.10925, 2017.
 [21] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
 [22] J. Johnson, A. Alahi, and L. FeiFei. Perceptual losses for realtime style transfer and superresolution. In ECCV, 2016.
 [23] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky. Texture networks: Feedforward synthesis of textures and stylized images. In ICML, 2016.
 [24] F. Liu, C. Shen, and G. Lin. Deep convolutional neural fields for depth estimation from a single image. In CVPR, 2015.
 [25] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS, 2014.
 [26] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG), volume 23, pages 309–314, 2004.
 [27] Y. Y. Boykov and M. P. Jolly. Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In ICCV, 2001.
 [28] L. Grady. Random walks for image segmentation. TPAMI, 28(11):1768–1783, 2006.
 [29] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [30] A. Levin, D. Lischinski, and Y. Weiss. A closedform solution to natural image matting. TPAMI, 30(2):228–242, 2008.
 [31] Y. Chuang, B. Curless, D. H. Salesin, and R. Szeliski. A bayesian approach to digital matting. In CVPR, 2001.
 [32] E. S. Gastal and M. M. Oliveira. Shared sampling for realtime alpha matting. In Computer Graphics Forum, 2010.
 [33] K. He, C. Rhemann, C. Rother, X. Tang, and J. Sun. A global sampling method for alpha matting. In CVPR, 2011.
 [34] L. Grady, T. Schiwietz, S. Aharon, and R. Westermann. Random walks for interactive alphamatting. In Proceedings of VIIP, volume 2005, pages 423–429, 2005.
 [35] Jue Wang and Michael F Cohen. Optimized color sampling for robust matting. In CVPR, 2007.
 [36] X. Shen, X. Tao, H. Gao, C. Zhou, and J. Jia. Deep automatic portrait matting. In ECCV, 2016.
 [37] D. Cho, Y. Tai, and I. Kweon. Natural image matting using deep convolutional neural networks. In ECCV, 2016.
 [38] A. Beck and M. Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
 [39] L. A Vese and S. J. O. Modeling textures with total variation minimization and oscillating patterns in image processing. Journal of scientific computing, 19(13):553–572, 2003.
 [40] M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing, 15(12):3736–3745, 2006.
 [41] D. Fleet and Y. Weiss. Optical flow estimation. In Handbook of mathematical models in computer vision, pages 237–257. 2006.
 [42] A. Levin, R. Fergus, F. Durand, and W. T. Freeman. Image and depth from a conventional camera with a coded aperture. ACM transactions on graphics (TOG), 26(3):70, 2007.
 [43] M. Rubinstein, D. Gutierrez, O. Sorkine, and A. Shamir. A comparative study of image retargeting. In ACM transactions on graphics (TOG), volume 29, page 160, 2010.
 [44] D. J. Jobson, Z. Rahman, and G. A. Woodell. A multiscale retinex for bridging the gap between color images and the human observation of scenes. IEEE Transactions on Image processing, 6(7):965–976, 1997.
 [45] C. Couprie, L. Najman, and H. Talbot. Seeded segmentation methods for medical image analysis. In Medical Image Processing, pages 27–57. Springer, 2011.
 [46] Matthew A Kupinski and Maryellen L Giger. Automated seeded lesion segmentation on digital mammograms. IEEE Transactions on medical imaging, 17(4):510–517, 1998.
 [47] F. Yu and V. Koltun. Multiscale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
 [48] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [49] C. Rhemann, C. Rother, J. Wang, M. Gelautz, P. Kohli, and P. Rott. A perceptually motivated online benchmark for image matting. In CVPR, 2009.
 [50] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.