Deep Energy: Using Energy Functions for Unsupervised Training of DNNs

05/31/2018 ∙ by Alona Golts, et al. ∙ Google Technion 0

The success of deep learning has been due in no small part to the availability of large annotated datasets. Thus, a major bottleneck in the current learning pipeline is the human annotation of data, which can be quite time consuming. For a given problem setting, we aim to circumvent this issue via the use of an externally specified energy function appropriate for that setting; we call this the "Deep Energy" approach. We show how to train a network on an entirely unlabelled dataset using such an energy function, and apply this general technique to learn CNNs for two specific tasks: seeded segmentation and image matting. Once the network parameters have been learned, we obtain a high-quality solution in a fast feed-forward style, without the need to repeatedly optimize the energy function for each image.



page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning [1] has enjoyed remarkable success in a wide range of fields including speech recognition [2], image processing [3]

, computer vision


and natural language processing


. One of the key factors in the rise of deep learning has been the availability of large-scale annotated/labeled datasets such as Imagenet

[6]. While for some tasks, such as classification, the collection of large-scale ground-truth annotations is tolerable, in other tasks it is extremely tedious and expensive. For instance, it has been shown recently [7] that pixelwise labelling for semantic segmentation on Pascal VOC 2012 [8] takes about 4 minutes per image.

Beyond standard human annotation, there exist two main data labelling approaches. The first, annotations using simulated data (see e.g. [9, 10, 11, 12, 13, 14]

), requires some expertise in terms of the behavior of the signals under consideration; further, there is often a domain shift between real and simulated data. In the second approach, which applies to image-processing tasks such as denoising, deblurring, super-resolution and compression, one can create the inputs to the network as a simple degradation of the original outputs/labels. For example, this technique was used and extended in


We propose a different approach, which circumvents the annotation issue entirely, by relying on an externally specified energy function. This energy is of a particular form: it is a function of both the data and the label, and when optimized over the desired label (holding the data fixed) it gives a prediction for the label given the data. Where would such an energy function come from? Fortunately, the computer vision and image processing literature is replete with examples of such functions for solving e.g., depth from stereo [15, 16], super-resolution [17], single-image dehazing [18]

and optical flow estimation

[19]. These energies typically incorporate terms which ensure fidelity to the observed data, as well as some sort of signal prior. With this in mind, the main thrust of this paper is as follows: given such an energy function, we show how to formulate learning on an entirely unlabelled dataset

. Thus, one can save immensely on the cost of annotation, by dispensing with it entirely. Alternatively, with the proposed scheme, one could offer a possible fusion with supervised learning. The outcome of our method can be used as initial near-perfect automatic annotations, later to be manually adjusted in a much shorter time in order to create labeled DB for supervised learning. Also, if a small group of labeled data is given, this could be fused as well to boost the overall scheme.

In this context, a natural question might be: given the energy function, why bother learning at all – why not simply optimize the energy at run-time? There are two main answers to this question. The first is speed: once the network has been learned, the solution can be obtained by an almost instant forward-pass, which is typically orders of magnitude faster than explicit optimization. The second answer relates to regularization, from two points of view. Just as in [20], the mere choice of a specific CNN architecture to represent the prediction acts as a successful regularizer. Further, having the network fit its solution to a variety of images during training acts as an additional regularization.

The main contributions of this paper are as follows. First, we propose the Deep Energy approach for learning on unlabelled datasets (Section 3.1). Second, we show how to apply this formalism to two example problems, seeded segmentation (Section 3.2) and image matting (Section 3.3). Third, we learn CNNs for each of these problems on unlabelled datasets, and show performance which often exceeds the analytical solution to the corresponding explicit energy minimization (Section 4).

2 Related Work

Training CNNs via Energy Minimization   Our approach is inspired by recent work on image style transfer. Given “content” and “style” images, the objective of [21] is to create a new image of the same content, in the style of a famous painting. The key to this method is the use of an energy function, which includes two terms. The first, a data fidelity term that forces the new image to be close to the content image in terms of a perceptual loss. The second, a “prior” or “physical” term, which promotes closeness in terms of second-order statistics between the newly-created and style images. Instead of minimizing this energy function for each input image, the works reported in [22, 23] propose training a fully convolutional network to do so. During the training, as opposed to being given pairs of regular and stylized images and training the network in a standard supervised fashion, the authors proceed in an unsupervised way, via direct minimization of the energy function previously described. By the end of the training process, the network has learned to transform any input image into a stylized artistic image, by a fast and simple forward-pass operation.

Another line of work which is somewhat related to our approach is joint training of neural networks and graphical models

[24, 25]. The work reported in [24] tackles the task of single image depth estimation, using a neural network that is trained to provide the unary and pairwise potentials for a continuous CRF (Conditional Random Field). As opposed to supervised training, which regresses a neural network over ground-truth depth maps, their training of the unary and pairwise sub-networks involves an explicit minimization of a log-likelihood expression, modeling the relations between neighboring pixels. At test time, the two sub-networks provide the potentials that are fed to a closed-form analytic solution of the objective function used during training. In [25], the authors propose a hybrid between CNNs and MRFs (Markov Random Fields) in order to solve the human pose estimation problem. Their solution consists of two networks, the first, a supervised feed-forward part-detector providing rough heat-map estimations, and the second, a high-level spatial-model to enforce global pose consistency, formulated via CRFs. While these methods concentrate on specific applications and energy functions, we provide a broad end-to-end formulation.

Seeded Segmentation   Seeded segmentation has enjoyed a considerable amount of attention with the introduction of the popular graph-cuts [26, 27] and random-walker [28] algorithms. With the emergence of deep learning and fully annotated datasets such as Pascal VOC 2012 [8] and Imagenet [6], the focus has shifted to fully automatic semantic segmentation [29]. A recent attempt to alleviate the burden of per-pixel image annotation has been proposed in [7]. They suggest weak supervision in the form of user-provided points for the task of semantic segmentation. Their approach however does not originate from an energy minimization perspective and the weak supervision is only applied during training. At test time, regular semantic segmentation is performed.

Image Matting   There have been numerous proposals for tackling the image matting problem. Since this problem is ill-posed, an additional regularization in the form of user interaction is required. Usually the user must input an additional “trimap”, a rough segmentation of the image containing foreground, background and unknown pixels. The unknown pixels lie in the border between the object and the background, and usually contain challenging features such as hair, fur, semi-transparent objects, etc. In order to successfully output an accurate image matte, some iterative, often non-linear, optimization is involved [30, 31, 32, 33, 34, 35]. Several deep learning-based techniques for image matting [9, 36, 37] have been proposed over the past two years. The work in [36] uses a CNN to learn a trimap, to be later fed as input to an analytic solver. In [37], several analytic solvers are used to create output alpha mattes for supervised training of a CNN. As opposed to these works, we circumvent the use of an analytic solver and directly train the network via an appropriate objective function. In [9] a large dataset for supervised image matting is created, using a composition of interesting matting objects over natural scene backgrounds. Using this dataset, end-to-end image matting is performed. This method provides impressive results on simulated image-matting benchmarks, however it may suffer from domain-shift when applied to real-life images.

3 The Deep Energy Approach

3.1 The General Approach

Suppose that given data , our goal is to compute the corresponding label ; it would be instructive to think of the label as another signal, rather than a discrete label as in classification. We assume the existence of an energy function for which


This formulation is quite standard in signal processing, image processing, and computer vision; it is the essence of an inverse problem approach [38, 39, 40].

Our idea is to learn a network which will effectively solve problem (1). In particular, consider a network parameterized by parameters , such that the function is hypothesized to look like , where stands for “prediction”. Thus, learning how to solve the inverse problem is equivalent to learning the parameters . Assuming that our unlabelled training data consists of examples of the datum , , in order to learn we propose to solve:


We essentially learn the network so that, averaged over all samples, the energy function is minimized (with respect to the network parameters). This is the deep energy approach: we learn a prediction function without seeing any labelled data, via an externally specified energy function.

The proposed scheme can serve any application that can be posed in terms of an informative energy function, for which there is a difficulty in collecting large-scale data for supervised learning. Examples for such tasks are single image dehazing [18], optical flow [41], single image depth estimation [42], image retargetting [43], Retinex [44] and more. We turn now to demonstrate two specific such applications of Deep Energy: seeded segmentation and image matting.

3.2 Application: Seeded Segmentation

Background   The first application we consider is seeded segmentation, in which the input consists of both an image and user-provided seeds, and the output is a segmentation of the image. These seeds, in the form of points or scribbles, provide the algorithm with an additional cue in computing the segmentation. Seeded segmentation has enjoyed a certain amount of popularity within the medical imaging community [45, 46], given the fact that it allows the user a degree of control over the final result. From our point of view, seeded segmentation is an interesting test-bed for our Deep Energy approach, given the fact that there are standard energy functions for performing this task.

In particular, we use the energy function of the well-known Random Walker algorithm of Grady et al. [28], which we now briefly detail. The image is represented as a weighted undirected graph , where the vertices are the pixels, and edges are between each pixel and its 4-neighborhood. The weight of edge is given by , where are the grayscale values of the pixels and is a global scaling parameter.

The idea behind the random walker is the following: the probability of a random walk starting from a given pixel and reaching each one of the seeds/classes in the image, is also the final probability the pixel receives for that class. Formally, let

be the probability of each pixel in the image to belong to the label of class , where is the number of pixels in the image. We denote as the “seed image” for the class, whose elements are or , indicating absence or presence of a seed respectively.

are their respective vectorized versions over

classes. Then, one can solve for the random walk probabilities by minimizing the following energy:


where is a matrix whose diagonal elements indicate the presence of a seed (of any class) at a given pixel; and is the Laplacian of the graph. The first term is a smoothness term, ensuring that the segmentation is spatially smooth, though discontinuities are penalized less where the underlying image is discontinuous. The second term encourages fidelity to the seeds.

Tensorization   One can directly use the energy expression in Equation (3

) as a loss function during training. We now show how to convert this function to a form which is friendlier for the tensor-processing which is common in deep learning packages. We rewrite Equation (

3) as


using a common expansion of the Laplacian quadratic form. Now, concatenate identical copies of the output along the last dimension to create , where is the number of neighbors ( in our case). Further, form , as a concatenation of the “neighbor images” of the output ; in the case of 4-neighborhoods, the neighbor images are simply the image shifted left, right, up and down. Finally, the weights can be represented as an matrix; we then take copies of this matrix, and put them in the 3-tensor . Then the energy function can be written as follows:


where we have summed over the batch dimension, ; and the powers are taken elementwise.

3.3 Application: Image Matting

Background   A second energy function which will illustrate the Deep Energy approach comes from the problem of image matting. In this task, one extracts an object from its background by determining the opacity and color of each pixel in the foreground. The input is an RGB image , which is a composition of the foreground image, , and the background image, . Each pixel in the image can be described by the matting equation:


where is the opacity of each pixel, also called the alpha matte. Note that recovering the alpha matte from is extremely ill-posed. One has to deduce seven quantities per each pixel (the R, G, and B values for and , and the value of ).

In [30] the authors propose a closed-form solution to the image matting problem, with the help of minimal user interaction, again in the form of seeds. They derive a cost function using local smoothness assumptions on the foreground and background images, and show that it is possible to entirely eliminate the dependence on these images, thus allowing a closed-form global solution of the alpha matte. In particular, one may minimize the following energy:


where is a diagonal matrix whose diagonal elements indicate the presence of a seed – either foreground or background – at a given pixel, and is the seed image, whose elements are or , indicating foreground or background respectively. Note that if a pixel is not a seed, can take on any value, as it will be zeroed out by the multiplication by . Finally, the matrix is a special Laplacian-like matrix, specific to the matting problem; as derived in [30], it is given by


where is a patch around pixel ; and are the mean and covariance of the patch; and

is the identity matrix.

Tensorization   We again rewrite the energy function in Equation (7) in a tensor-friendly format. Rephrasing the first term in (7) in terms of weights, we have that


where we sum over all overlapping patches around pixels in the resulting alpha matte, as well as over all possible combinations of pixel pairs in a given patch, where the total number of combinations is . The weights, , are derived from the Laplacian and are given as:


We can then tensorize the matting term and add the tensorized data term to get the final loss function:


where indexes the pixel pairs in a patch and is the matrix of weights. are repetitions of the alpha matte; the first represents the alpha mattes in index and the second represents the alpha mattes in index . The data term of the energy function is exactly the same as that in seeded segmentation in Equation (5), apart from the minor difference that there is no need for summation over the classes since the resulting alpha matte is already -dimensional, and and are simply the foreground and background seed images.

Figure 1: System architecture. At input, our fully-convolutional network receives concatenated pairs of images and seeds. Apart from the input and output layers, our network is a cascade of “dilated residual blocks” (denoted by , where is the maximum dilation factor), which gradually increases the receptive field. The network’s prediction, along with the input image and seeds, are fed to the deep energy loss, which essentially approximates the task-specific objective function.

3.4 Architecture

Our network structure, shown in Figure 1, is fully convolutional and based on the Context Aggregation Network (CAN) introduced in [47]. Aggregation is performed by dilated convolutions with gradually increasing rates, that increase the receptive field and spread out the sparse seed data. We incorporate Resnet-style [48] skip connections to allow for additional gradient flow through the network layers.

More specifically, the intermediate layers consist of a cascade of dilated residual blocks. Each such block contains consecutive regular conv layers, then a single dilated-conv layer with a dilation factor of where is the block number. We add skip connections between each block’s input and output as shown in fig. 1. Similar to Resnet [48], we connect the input and output by a simple addition, keeping the width of the output intact. Apart from the final layer, all convolution layers have filters with an output width of

. All layers, apart from the output layer and each block’s last dilated conv layer, are followed by batch-norm and ReLU nonlinearity. The dilated-conv layers in each block are followed by batch-norm. The final conv layer is a simple linear transformation performed with

convolution of width equal to the size of the desired output,

. In seeded segmentation, there is an additional softmax layer that converts the output scores into probabilities.

4 Results

We now present the results of our deep energy loss on the seeded segmentation and image matting tasks. The results showcase the ability of deep energy learning to mimic the behavior of random-walk segmentation [28] and closed-form image matting [30].

Dataset   We use the annotated training images in the Pascal VOC dataset [8] to create training and test images, for both tasks discussed. It is critical to note that although these images are fully annotated with pixelwise segmentation masks, we never make use of this information; rather, we simply use the masks to generate very simple seeds, which are inputs to both seeded segmentation and image matting as previously described. We now describe this process of seed generation.

We perform binary segmentation on each PASCAL VOC image, in which we isolate a single object from the rest of the image. If there exist several objects in the original image, we create several copies of the same image with different seeds in each distinct object. By looking at only one object at a time and eliminating images with very small objects, we create train and test images. For seeded segmentation, we perform data augmentation in the form of flips and rotations by and , increasing the number of training images to . To generate the seeds, we add a single line or circle marker for each object by first attempting to plant it in the center of mass, or if that fails, in a random location within the object. We resize the resulting images to either or .

Note that the Pascal VOC is not intended for image matting. There exist smaller datasets [49] that are less suitable for deep learning applications, or carefully controlled simulated datasets [9] that may not fully capture the expected behavior in the wild. We chose instead to demonstrate our approach on natural images, which are easy to obtain; and to use the minimal seeds described above rather than the complex trimaps commonly used in image matting [49, 9].

Implementation Details

   We implement our technique in TensorFlow on a GTX Titan-X Nvidia GPU. We use the Adam

[50] optimizer with an initial learning rate of with an exponential decay factor of every epochs. We use a batch size of and train the network for epochs over the training data. In seeded segmentation, we set , and . We train on images and cascade dilated residual blocks reaching a maximum dilation factor of . In image matting we set , train on images and again use a maximum dilation rate of . The training time for both seeded segmentation and image matting is hours.

Figure 2: Seeded segmentation results from left to right: original image, analytic and network probabilities, ground truth segmentation with overlayed seeds, analytic and network segmentation.
Figure 3: Qualitative results of image matting. From left to right: original image, input trimap, matting ground truth, analytic solution and network solution

Qualitative Results   We show the results of seeded segmentation on resized images from the test set in Figure 2. One can see that our network approximates the analytic solution well. There are even occasions where the network solution outperforms the analytic one, when compared with the ground truth; for example, see rows of Figure 2. In other cases (see row ), the network does not seem to disperse the seeds properly, a possible result of cascading limited-sized convolutions.

Since Pascal VOC is not intended for image matting, we show results in Figure 3 on images from the commonly used [49] dataset, resized to . As opposed to our training process where we are given seeds, this dataset includes heavier user assistance in the form of trimaps. We treat the trimap as a seed image of foreground and background, with the missing gray pixels to be completed by the algorithm. Despite the very different nature of the train and test sets, the network performs quite well, demonstrating a good generalization ability. In the bottom rows, one can see blurriness in the network alpha mattes, which is a result of using a rather long chain of repeated convolutions. This can be remedied by an additional “refinement network” as suggested in [9], adapted to the deep energy loss. We leave this important direction for future work.

mean-loss mean-IOU mean-MSE
analytic network random analytic network analytic network
segmentation train
matting train
Table 1: Quantitative results of our approach

Proximity to Analytic Solution   Since we train the network to minimize a given energy function, we would like to verify how close it comes to actually minimizing this function, by comparing the energy values obtained. In seeded segmentation, the analytic solution is the closed-form solution of the objective function in Equation (5). We plug in the analytic and network solutions to this equation, and get two sets of values, the analytic loss and the network loss. We repeat this process for image matting and compute the analytic solution of the energy function, given in Equation (7). For reference, we also add the average loss value for a solution provided via a freshly-initialized network with random weights. The results in Table 1 show that the network loss is close to the analytic one (compared to initial “random” loss value), averaged over both train and test sets. The small gap between the network and analytic losses might be reduced through selection of different architectures, better suited for these specific tasks; or through the use of less arbitrary, more human-like seeds.

analytic network speedup analytic network speedup analytic network speedup
Table 2: Evaluation speedup using our approach

Accuracy   In seeded segmentation, since our final goal is providing a segmentation map, we compare the performance of the analytic and network solutions in terms of the mIOU (mean intersection over union) metric, computed against the ground-truth segmentations of the train and test sets. The actual segmentation is performed by an argmax operation over the class dimension (consisting of foreground-background) of the output probabilities. The results in Table 1 show that the network outperforms the analytic solution in terms of mIOU (the higher the better). This result shows that although the network’s only job is to mimic the analytic solution by approximating the random-walker energy function, it learns something about natural images which boosts the results. Note further the generalization ability of our approach: the segmentation performance of the test set as measured by mIOU is almost identical to that of the train set. This can be attributed to the fact that our training gains regularization through the use of the energy as a loss, rather than using a supervised formulation. In image matting, we evaluate the average matting quality on the training images from versus the ground truth alpha mattes. To quantify the performance of both solutions, we use the MSE metric proposed by [49]. As one can see from the results in Table 1, the analytic and network solutions are further apart then in seeded segmentation. This can be attributed to the fact that the network has been trained on an entirely different dataset.

Speed   We now compare the speed of the analytic vs. network solutions. The analytic solution is written in fully vectorized Numpy/Scipy code and the network is implemented in TensorFlow. Table 2 shows the average evaluation time of images of varying sizes using the analytic and network solutions. One can see a clear benefit in terms of speed in favor of our approach. As image dimensions grow, this increases, up to a factor of in seeded segmentation and in image matting.

5 Conclusion

We have introduced a new method of training CNNs by direct minimization of energy functions, which are commonly used in image processing and computer vision. We have shown application of this training technique to seeded segmentation and image matting, and demonstrated results which are comparable to, and sometimes exceed, the analytic solution. Future research will focus on fusing the deep energy loss with a standard supervised loss, to learn on datasets with few labelled examples.


  • [1] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.
  • [2] D. Amodei et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In ICML, 2016.
  • [3] Q. Chen, J. Xu, and V. Koltun. Fast image processing with fully-convolutional networks. In ICCV, 2017.
  • [4] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, 2014.
  • [5] X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification. In NIPS, 2015.
  • [6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
  • [7] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei. What’s the point: Semantic segmentation with point supervision. In ECCV, 2016.
  • [8] M. Everingham, L. Van Gool, C. KI. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010.
  • [9] N. Xu, B. Price, S. Cohen, and T. Huang. Deep image matting. In CVPR, 2017.
  • [10] W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao, and M. Yang.

    Single image dehazing via multi-scale convolutional neural networks.

    In ECCV, 2016.
  • [11] E. Richardson, M. Sela, and R. Kimmel. 3d face reconstruction by learning from synthetic data. In 3D Vision (3DV), 2016 Fourth International Conference on, 2016.
  • [12] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, 2014.
  • [13] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016.
  • [14] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In ICCV, 2015.
  • [15] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. TPAMI, 26(9):1124–1137, 2004.
  • [16] T. Meltzer, C. Yanover, and Y. Weiss. Globally optimal solutions for energy minimization in stereo vision using reweighted belief propagation. In ICCV, 2005.
  • [17] M. Elad and A. Feuer. Restoration of a single superresolution image from several blurred, noisy, and undersampled measured images. IEEE transactions on image processing, 6(12):1646–1658, 1997.
  • [18] K. He, J. Sun, and X. Tang. Single image haze removal using dark channel prior. TPAMI, 33(12):2341–2353, 2011.
  • [19] A. Wedel, D. Cremers, T. Pock, and H. Bischof. Structure-and motion-adaptive regularization for high accuracy optic flow. In ICCV, 2009.
  • [20] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior. arXiv preprint arXiv:1711.10925, 2017.
  • [21] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
  • [22] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
  • [23] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, 2016.
  • [24] F. Liu, C. Shen, and G. Lin. Deep convolutional neural fields for depth estimation from a single image. In CVPR, 2015.
  • [25] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS, 2014.
  • [26] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG), volume 23, pages 309–314, 2004.
  • [27] Y. Y. Boykov and M. P. Jolly. Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In ICCV, 2001.
  • [28] L. Grady. Random walks for image segmentation. TPAMI, 28(11):1768–1783, 2006.
  • [29] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • [30] A. Levin, D. Lischinski, and Y. Weiss. A closed-form solution to natural image matting. TPAMI, 30(2):228–242, 2008.
  • [31] Y. Chuang, B. Curless, D. H. Salesin, and R. Szeliski. A bayesian approach to digital matting. In CVPR, 2001.
  • [32] E. S. Gastal and M. M. Oliveira. Shared sampling for real-time alpha matting. In Computer Graphics Forum, 2010.
  • [33] K. He, C. Rhemann, C. Rother, X. Tang, and J. Sun. A global sampling method for alpha matting. In CVPR, 2011.
  • [34] L. Grady, T. Schiwietz, S. Aharon, and R. Westermann. Random walks for interactive alpha-matting. In Proceedings of VIIP, volume 2005, pages 423–429, 2005.
  • [35] Jue Wang and Michael F Cohen. Optimized color sampling for robust matting. In CVPR, 2007.
  • [36] X. Shen, X. Tao, H. Gao, C. Zhou, and J. Jia. Deep automatic portrait matting. In ECCV, 2016.
  • [37] D. Cho, Y. Tai, and I. Kweon. Natural image matting using deep convolutional neural networks. In ECCV, 2016.
  • [38] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
  • [39] L. A Vese and S. J. O. Modeling textures with total variation minimization and oscillating patterns in image processing. Journal of scientific computing, 19(1-3):553–572, 2003.
  • [40] M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing, 15(12):3736–3745, 2006.
  • [41] D. Fleet and Y. Weiss. Optical flow estimation. In Handbook of mathematical models in computer vision, pages 237–257. 2006.
  • [42] A. Levin, R. Fergus, F. Durand, and W. T. Freeman. Image and depth from a conventional camera with a coded aperture. ACM transactions on graphics (TOG), 26(3):70, 2007.
  • [43] M. Rubinstein, D. Gutierrez, O. Sorkine, and A. Shamir. A comparative study of image retargeting. In ACM transactions on graphics (TOG), volume 29, page 160, 2010.
  • [44] D. J. Jobson, Z. Rahman, and G. A. Woodell. A multiscale retinex for bridging the gap between color images and the human observation of scenes. IEEE Transactions on Image processing, 6(7):965–976, 1997.
  • [45] C. Couprie, L. Najman, and H. Talbot. Seeded segmentation methods for medical image analysis. In Medical Image Processing, pages 27–57. Springer, 2011.
  • [46] Matthew A Kupinski and Maryellen L Giger. Automated seeded lesion segmentation on digital mammograms. IEEE Transactions on medical imaging, 17(4):510–517, 1998.
  • [47] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
  • [48] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [49] C. Rhemann, C. Rother, J. Wang, M. Gelautz, P. Kohli, and P. Rott. A perceptually motivated online benchmark for image matting. In CVPR, 2009.
  • [50] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.