1 Motivation and background
Regularized loss functions are widely used for weakly supervised training of neural networks
weston2012deep ; goodfellow2016deep . In particular, they are useful for weakly supervised CNN segmentation NCloss:CVPR18 ; Rloss:ECCV18 where full supervision if often infeasible, particularly in biomedical applications. Such losses are motivated by regularization energies in shallow^{1}^{1}1In this paper, “shallow” refers to methods unrelated to deep learning.
segmentation, where multidecade research efforts went into designing robust regularization models based on geometry MumfordShah89 ; Sapiro:97 ; BK:ICCV03 , physics Kass:88 ; BlakeZis:87 , or robust statistics Geman:84 . Such models should represent realistic shape priors compensating for significant image data ambiguities, yet be amenable to efficient solvers. Many robust regularization models commonly used in vision SzeliskiMRFCompare:08 ; kappes2015comparative are nonconvex and require powerful optimizers to avoid many weak local minima. Basic local optimizers typically fail to produce practically useful results with such models.Effective weaklysupervised CNN methods for vision should incorporate priors compensating for image data ambiguities and lack of supervision just as in shallow vision methods. However, the use of regularization models as losses in deep learning is limited by the ability to optimize them via gradient descent, the backbone of current training methods.
This paper uses weakly supervised CNN segmentation as a representative example to discuss optimization of a general class of regularized losses based on pairwise Potts model, one of the most basic robust models in vision. We consider two common variants, the nearestneighbor and largeneighborhood Potts, also known as sparse grid CRF and dense CRF models in shallow segmentation.
1.1 Pairwise CRF regularization for shallow segmentation
Robust pairwise Potts model and its binary version (Ising model) are used in many application such as stereo, reconstruction, and segmentation. One can define this model as a cost functional over integervalued labeling of image pixels as follows
(1) 
where is a given neighborhood system, is a discontinuity penalty between neighboring pixels , and is Iverson bracket. The nearestneighbor version over connected grid , as well as its popular variational analogues, e.g. geodesic active contours Sapiro:97 , convex relaxations pock:CVPR09 ; chambolle2011first , or continuous maxflow Yuan:CVPR2010 , are particularly wellresearched. It is common to use contrastweighted discontinuity penalties BVZ:PAMI01 ; BJ:01 between the neighboring points, as emphasized by the condition below
(2) 
Nearest neighbor Potts models minimize the contrastweighted length of the segmentation boundary preferring shorter perimeter aligned with image edges, e.g. see Fig.2 (b). The popularity of this model can be explained by generality, robustness, wellestablished foundations in geometry, and a large number of efficient discrete or continuous solvers that guarantee global optimum in binary problems BJ:01 or some quality bound in multilabel settings, e.g. expansion BVZ:PAMI01 .
Dense CRF koltun:NIPS11 is a Potts model where pairwise interactions are active over significantly bigger neighborhoods defined by a Gaussian kernel with a relatively large bandwidth over pixel locations
(3) 
Its use in shallow vision is limited as it often produces noisy boundaries koltun:NIPS11 , see also Fig.2 (c). Also, global optimization methods mentioned above do not scale to dense neighborhoods. Yet, dense CRF model is very popular in the context of CNNs where it can be used as a trainable regularization layer zheng2015conditional . Larger bandwidth yields smoother objective (1), see Fig.1 (c), amenable to gradient descent or other local linearization methods like meanfield inference that are easy to parallelize. Note that existing efficient inference methods for dense CRF require bilateral filtering koltun:NIPS11 , which is restricted to Gaussian weights exactly as in (3). This is in contrast with global Potts solvers, e.g. expansion, that can use arbitrary weights, but become inefficient for dense neighborhoods.
(a) 1D image  (b) grid CRF BJ:01  (c) dense CRF koltun:NIPS11 
(a) image + seeds  (b) grid CRF BJ:01  (c) dense CRF koltun:NIPS11 
Noise in dense CRF inference results suggests weaker regularization properties. Indeed, it is easy to check that for increasingly larger neighborhoods the Potts model gets closer and closer to cardinality potentials. Bandwidth in (3) defines a scale or resolution at which the dense CRF model sees the segmentation boundary. Weaker regularization in dense CRF may occasionally preserve thin structures that could be oversmoothed by fineresolution boundary regularizers, e.g. nearest neighbor Potts. However, this is the same phenomena as preservation of the noise in Fig.2.
For consistency and shortness, the rest of the paper will refer to the nearestneighbor Potts model as grid CRF, and largeneighborhood Potts as dense CRF.
1.2 Summary of contributions
Any motivation for standard regularization models in shallow image segmentation, as in the previous section, directly translates into their motivation as regularized loss functions in weakly supervised CNN segmentation NCloss:CVPR18 ; Rloss:ECCV18 . The main issue is how to optimize these losses. Standard training techniques based on gradient descent may not be appropriate for many powerful regularization models, which may have many local minima. Below is the list of our main contributions:

As an alternative to gradient descent (GD), we propose a general ADMbased framework for minimizing regularized losses during network training that can directly employ known efficient solvers for the corresponding shallow regularizers.

Compared to GD, our ADM approach with expansion solver significantly improves optimization quality for the grid CRF (nearestneighbor Potts) loss in weakly supervised CNN segmentation. While each iteration of ADM is slower than GD, the loss function decreases at a significantly larger rate with ADM. In one step it can reach lower loss values than those where GD converges.

The training quality with grid CRF loss achieves thestateoftheart in weakly supervised CNN segmentation. We compare dense CRF loss and (sparse) grid CRF losses.
Our results may inspire more research on loss functions and their optimization.
2 ADM for loss optimization
Given an image and its partial groundtruth labeling or mask , we consider a CNN regularized loss of the following form:
(4) 
where is a way softmax segmentation generated by the network, with the number of labels and the set of parameters. is a regularization term, e.g., sparse Potts or dense CRF, and is a partial groundtruth loss, for instance:
where is the set of labeled pixels and is the cross entropy between network predicted segmentation (a row of matrix corresponding to point ) and ground truth labeling . is the matrix whose rows are given by the ’s.
We present a general alternating direction method (ADM) to optimizing regularized losses of the general form in (4) using the following decomposition of the problem:
(5) 
where
denotes some divergence measure, e.g., the KullbackLeibler divergence, and
a Lagrange multiplier for constraints . In (5), we introduced latent discrete (binary) variables
, which are unknown of unlabeled pixels and constrained to be equal to for labeled pixels ( is the binary matrix whose rows are given by the ’s). Therefore, instead of optimizing the regularization term with gradient descent, our approach splits regularizedloss problem (4) into two subproblems. We replace the network softmax outputs in the regularization term by latent discrete variables and ensures consistency between both variables (i.e., and ) by minimizing divergence . This is similar conceptually to the general principles of ADM ^{2}^{2}2In its most basic form, ADM transforms an original problem into and alternates optimization over and , optimizing and separately. Boyd2011 . Our ADM splitting accommodates the use of powerful and wellestablished discrete solvers for the regularization loss. As we will see in the experiments, the popular expansion solver BVZ:PAMI01 significantly improves optimization of grid CRF losses yielding stateoftheart training quality. Such efficient discrete solvers guarantee global optimum in binary problems BJ:01 or some quality bound in multilabel settings BVZ:PAMI01 . Our discretecontinuous ADM method alternates two steps, each decreasing (5), until convergence. Given fixed discrete latent variables computed at the previous iteration, the first step learns the network parametersby minimizing the following loss via standard backpropagation and stochastic gradient descent (SGD):
(6) 
The second step fixes the network output and finds the next latent binary variables by minimizing the following objective over via expansion:
(7) 
3 Experimental results
We conduct experiments for weakly supervised CNN segmentation with scribbles as supervision. The focus is on regularized loss approaches NCloss:CVPR18 ; Rloss:ECCV18 yet we also compare our results to proposal generation based method, e.g. ScribbleSup scribblesup . We test both Grid CRF and Dense CRF as regularized losses. Such regularized loss can be optimized by stochastic gradient descent (GD) or alternative direction method (ADM), as discussed in Sec. 1. We compare four variants, namely DenseCRFGD, DenseCRFADM, GridCRFGD and GridCRFADM for weakly supervised CNN segmentation.
Before comparing segmentations, in Sec. 3.1 we investigate if ADM optimization achieves better regularized losses than standard GD. Our plots of training losses (CRF energy) vs training iterations show how fast ADM or GD converges. Our experiment confirms that first order approach like GD leads to poor local minimum for GridCRF. There is clear advantage of ADM over GD for minimization of GridCRF loss. In Sec. 3.2, rather than comparing optimizers, we focus on the modeling, i.e. GridCRF vs DenseCRF. With ADM as the optimizer, our approach of GridCRF regularized loss gives very comparable segmentations to that of DenseCRF based approach. We also study these variants of regularized loss method in more challenging setting of shorter scribbles scribblesup or clicks in the extreme case.
Dataset and Implementation details Following recent work deeplab ; scribblesup ; kolesnikov2016seed ; NCloss:CVPR18 on CNN semantic segmentation, we report our results on PASCAL VOC 2012 segmentation dataset. We train with scribbles from scribblesup on the augmented datasets of 10,582 images and test on the val set. Besides standard mIOU (mean intersection over union), we also measure the regularization losses, i.e. GridCRF or DenseCRF. Our implementation is based on DeepLabv2 ^{3}^{3}3https://bitbucket.org/aquariusjay/deeplabpublicver2 and we show results on different networks including deeplablargeFOV, deeplabmsclargeFOV,deeplabvgg16 and resnet101. The networks are trained in two phases. First we train minimizing (partial) cross entropy loss w.r.t scribbles. Then we train with extra GridCRF or DenseCRF regularization loss. For inference of GridCRF and DenseCRF, we use the public implementation of expansion ^{4}^{4}4http://vision.csd.uwo.ca/code/gcov3.0.zip and meanfield koltun:NIPS11
respectively. The CRF inference and loss is implemented and integrated as Caffe
caffe layers. We run expansion for five iterations, which in most cases gives convergence. Our DenseCRF loss does not include the Gaussian kernel on locations , since ignoring this term does not change the mIOU measure koltun:NIPS11 . The bandwidth for dense Gaussian kernel on is validated to give the best mIOU. For GridCRF, the kernel bandwidth selection follows standard BoykovJolly BJ:01 .In general, our ADM optimization for regularized loss is slower than GD due to Grid/Dense CRF inference. However, for inference algorithms e.g. graph cuts that cannot be easily parallelized, we utilize simple multicore parallelization for images in a batch to accelerate training.
3.1 ADM vs gradient descent for CRF losses
In this section we show that for sparse grid CRF losses the ADM approach employing expansion BVZ:PAMI01 , a powerful discrete optimization method, outperforms common gradient descend methods for regularized losses NCloss:CVPR18 ; Rloss:ECCV18 in terms of finding a lower minimum of regularization loss. Table 1 shows grid CRF losses on both training^{5}^{5}5we have sampled 1000 training examples and validation sets for different network architectures. Figure 3(a) shows the evolution of the sparse CRF loss over the number of iterations of training. ADM requires fewer iterations to achieve the same CRF loss.
network  training set  validation set  
GD  ADM  GD  ADM  
DeeplabLargeFOV  2.518  2.408  2.509  2.334 
DeeplabMSclargeFOV  2.509  2.401  2.494  2.326 
DeeplabVGG16  2.374  2.098  2.421  2.138 
Resnet101  2.661  2.488  2.605  2.419 
In contrast to Grid CRF, Figure 3(b) shows that the ADM approach by kolesnikov2016seed ; NCloss:CVPR18 using Dense CRF is worse than simple gradient descend NCloss:CVPR18 due to limitations of the meanfield optimizer. There is no practical benefits of ADM over gradient descent for Dense CRF. The reason is that both meanfield and gradient descent are firstorder approach for approximating Gibbs distribution or the original energy.
(a)  (b) 
We also visualize the gradients with respect to the softmax layer’s input of the network in Figure 4. Despite different formulations of regularized losses and their optimization, the gradients w.r.t network output are the driving force for training. In most of the cases the gradient based method produces significant gradient values only in a vicinity of the current model prediction boundary. If the actual object boundary is sufficiently distant the gradient methods fail to detect it due to the sparsity of the grid CRF model, see Figure 1 for an illustrative “toy” example. On the other hand, the ADM method is able to predict a good latent segmentation allowing gradients leading to a good solution more effectively.
(a) input  (b) prediction  (c) Dense ADMkolesnikov2016seed  (d) Dense GDNCloss:CVPR18  (e) Grid ADM  (f) Grid GD 
Thus, in the context of sparse CRFs the ADM approach coupled with expansion shows drastic improvement in the optimization quality.
3.2 Grid CRF vs Dense CRF Loss for Weakly Supervised CNN Segmentation
The main results of this paper is summarized in Tab. 2. The mIOU measures on the val set of PASCAL 2012 are reported for various networks. To see more clearly the effects of Grid/Dense CRF losses for network training, we show results both with or without DenseCRF postprocessing, which is popularized by Chen et al. deeplab . The quality of weakly supervised segmentation is bounded by that with full supervision and we are interested in the gap for different weakly supervised approaches. Here we compare variants of regularized losses optimized by gradient descent (GD) or ADM. The regularized loss is comprised of partial cross entropy (pCE) w.r.t. scribbles and other regularizers or clustering criteria, e.g. Grid/Dense CRF or normalized cut(NC) Shi2000 ; NCloss:CVPR18 . The focus of comparison is on Grid CRF vs Dense CRF via gradient decent or ADM optimization.
As shown in Tab. 2, all regularized approaches work better than nonregularized loss approach that only minimizes empirical loss w.r.t. scribbles. GridCRFGD performs relatively the worst among other regularized loss method. This is due to the fact that firstorder method like gradient descent leads to poor local minimum for GridCRF in the context of energy minimization. Better optimization via ADM for GridCRF gives much better segmentation. Indeed, our GridADM compares favorably to DenseCRFGD and DenseCRFADM. The alternative GridCRF based method gives good quality segmentation approaching that for full supervision. Some qualitative segmentation examples are shown in Fig. 5.
GridCRF has been overlooked in deep CNN segmentation currently dominated by DenseCRF as postprocessing or trainable layers in fully supervised setting. We show that for weakly supervised CNN segmentation, GridCRF as regularized loss can give segmentation as good as that with DenseCRF. The key of minimizing GridCRF as loss is better optimization via ADM rather than gradient descent. Such competitive results for GridCRF loss confirms that GridCRF is better than DenseCRF in terms of regularization of segmentation, as discussed in Sec. 1.
It is not obvious whether GridCRF as loss is beneficial for CNN segmentation and we show that straightforward gradient descent for GridCRF didn’t work well. We are the first to systematically compare GridCRF and DenseCRF loss for weakly supervised CNN segmentation and discuss thoroughly their corresponding optimization via gradient descent or ADM. Our technical contribution on optimization helps to reveal the limitation and advantage of GridCRF vs DenseCRF models.
Note that our formulation of ADM for regularized loss in Sec. 2 is general and allow any regularization for which there is good inference/optimization algorithm. This connects the abundant literature on energy minimization for various geometric or highorder terms and loss minimization for weakly supervised CNN segmentation.
Post processing  Weak  
Network  Full  Gradient Descent  ADM  
pCE NCloss:CVPR18  NC NCloss:CVPR18  Dense  Grid  Dense  Grid  
DeeplablargeFOV  No  63.0  55.8  59.7  62.2  60.4  61.0  61.7 
DeeplabMSclargeFOV  No  64.1  56  60.5  63.1  61.2  61.3  62.9 
DeeplabMSclargeFOV  Yes  68.7  62.0  65.1  65.9  62.4  65.4  66.5 
DeeplabVGG16  No  68.8  60.4  62.4  64.4  63.3  63.4  65.2 
DeeplabVGG16  Yes  71.5  64.3  65.2  66.4  64.4  65.7  67.7 
ResNet101  No  75.6  69.5  72.8  72.9  71.7  72.5  72.8 
ResNet101  Yes  76.8  72.8  74.5  75.0  74.1  74.5  75.0 
(a) input  (b) Grid GD  (c) Grid ADM  (d) Dense GD  (e) Dense ADM  (f) ground truth 
Following the evaluation protocol in ScribbleSup scribblesup , we also test our regularized loss approaches training with gradually shortened scribbles. In the extreme case, scribbles degenerates to clicks for semantic objects and we are interested in how much weakly supervised segmentation degrades. As shown in Fig. 6, the performance of ScribbleSup scribblesup drops drastically with shorter scribbles. Our GridCRFADM approach gives competitive performance.
4 Conclusion
The topperforming supervised CNN segmentation NCloss:CVPR18 ; Rloss:ECCV18 is based on regularized loss framework for deep learning weston2012deep ; goodfellow2016deep . While this framework allows any differentiable regularization, gradient descent is known not to be a good optimization method for many regularization terms in shallow image segmentation, e.g. standard Grid CRF. In this paper, we propose a general ADMbased optimization framework for minimizing regularized losses that can take advantage of existing efficient solvers for the corresponding shallow regularizers. In particular, our ADM approach with expansion solver achieves significantly better optimization quality for Grid CRF compared to that with gradient descent. With such ADM optimization, training with grid CRF loss achieves thestateoftheart in weakly supervised CNN segmentation. We systematically investigated grid CRF and dense CRF losses from modeling and optimization perspectives. With the proposed ADM optimization strategy, we find the largely overlooked grid CRF to compare favorably to popular dense CRF. Our general ADM optimization framework allows further integration of other segmentation regularizers and their efficient solvers to weakly supervised CNN segmentation.
References
 [1] A. Blake and A. Zisserman. Visual Reconstruction. Cambridge, 1987.

[2]
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein.
Distributed optimization and statistical learning via the alternating
direction method of multipliers.
Foundations and Trends in Machine Learning
, 3(1):1–122, 2011.  [3] Y. Boykov and V. Kolmogorov. Computing geodesics and minimal surfaces via graph cuts. In International Conference on Computer Vision, volume I, pages 26–33, 2003.
 [4] Yuri Boykov and MariePierre Jolly. Interactive graph cuts for optimal boundary & region segmentation of objects in ND images. In ICCV, volume I, pages 105–112, July 2001.
 [5] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. IEEE transactions on Pattern Analysis and Machine Intelligence, 23(11):1222–1239, November 2001.
 [6] Vicent Caselles, Ron Kimmel, and Guillermo Sapiro. Geodesic active contours. International Journal of Computer Vision, 22(1):61–79, 1997.
 [7] Antonin Chambolle and Thomas Pock. A firstorder primaldual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011.
 [8] LiangChieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915, 2016.
 [9] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE transactions on Pattern Analysis and Machine Intelligence, 6:721–741, 1984.
 [10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
 [11] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
 [12] Jörg H Kappes, Bjoern Andres, Fred A Hamprecht, Christoph Schnörr, Sebastian Nowozin, Dhruv Batra, Sungwoong Kim, Bernhard X Kausler, Thorben Kröger, Jan Lellmann, et al. A comparative study of modern inference techniques for structured discrete energy minimization problems. International Journal of Computer Vision, 115(2):155–184, 2015.
 [13] M. Kass, A. Witkin, and D. Terzolpoulos. Snakes: Active contour models. International Journal of Computer Vision, 1(4):321–331, 1988.
 [14] Alexander Kolesnikov and Christoph H Lampert. Seed, expand and constrain: Three principles for weaklysupervised image segmentation. In European Conference on Computer Vision, pages 695–711. Springer, 2016.
 [15] Philipp Krahenbuhl and Vladlen Koltun. Efficient inference in fully connected CRFs with Gaussian edge potentials. In NIPS, 2011.

[16]
Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun.
Scribblesup: Scribblesupervised convolutional networks for semantic
segmentation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3159–3167, 2016.  [17] D. Mumford and J. Shah. Optimal approximations by piecewise smooth functions and associated variational problems. Comm. Pure Appl. Math., 42:577–685, 1989.
 [18] Thomas Pock, Antonine Chambolle, Daniel Cremers, and Horst Bischof. A convex relaxation approach for computing minimal partitions. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2009.
 [19] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22:888–905, 2000.
 [20] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, and C. Rother. A comparative study of energy minimization methods for markov random fields with smoothnessbased priors. IEEE transactions on Pattern Analysis and Machine Intelligence, 30(6):1068–1080, 2008.
 [21] Meng Tang, Abdelaziz Djelouah, Federico Perazzi, Yuri Boykov, and Christopher Schroers. Normalized Cut Loss for Weaklysupervised CNN Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [22] Meng Tang, Federico Perazzi, Abdelaziz Djelouah, Ismail Ben Ayed, Christopher Schroers, and Yuri Boykov. On Regularized Losses for Weaklysupervised CNN Segmentation. In European Conference on Computer Vision (ECCV), 2018.
 [23] Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semisupervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655. Springer, 2012.
 [24] Jing Yuan, Egil Bae, and XueCheng Tai. A study on continuous maxflow and mincut approaches. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2010.

[25]
Shuai Zheng, Sadeep Jayasumana, Bernardino RomeraParedes, Vibhav Vineet,
Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr.
Conditional random fields as recurrent neural networks.
In Proceedings of the IEEE International Conference on Computer Vision, pages 1529–1537, 2015.
Comments
There are no comments yet.