Deep convolutional neural networks (CNNs) are dominating in most visual recognition problems and applications, including semantic segmentation, action recognition  and object detection . When full supervision is available, CNNs can achieve outstanding performances, however this type of supervision may not be available in a wide range of applications. In semantic segmentation, full supervision involves annotating all the pixels in each training image. The problem is further amplified when such annotations require expert knowledge or involves volumetric data, as is the case in medical imaging . Therefore, the supervision of semantic segmentation with partial or weak labels, for example, scribbles [13, 27, 28], image tags [22, 19], bounding boxes  or points , has received significant research efforts in the last years.
Imposing prior knowledge on the network’s prediction via some unsupervised loss is a well-established technique in semi-supervised learning[30, 7]. Such prior acts as a regularizer that leverages unlabeled data with domain-specific knowledge. For instance, in semantic segmentation, several recent works showed that adding loss terms such as dense conditional random fields (CRFs) , graph clustering  or priors on the sizes of the target regions  can achieve outstanding performances with only fractions of full supervision labels. However, imposing hard inequality or equality constraints on the output of deep CNNs is still in a nascent stage, and only a few recent works have focused on the subject [22, 17, 24, 12].
1.1 Problem formulation
We consider a general class of semi- or weakly-supervised semantic segmentation problems, where global inequality constraints are enforced on the network’s output. In what follows, we present a formulation for 2D images, however the same formulation also applies to 3D images. Consider a training image , with a set of labeled pixels, which corresponds to a fraction of the pixels in image domain . For classes, let denote the ground-truth label of pixel , and
a standard K-way softmax probability output, withthe parameters of the network. In matrix , each row corresponds to the predictions for a pixel in , which can be either unlabeled or labeled. We focus on problems of the following general form, in which we optimize a partial (semi-supervised) loss subject to a set of inequality constraints on the network output:
where is some standard loss for the set of labeled pixels , e.g., the cross entropy111We give the cross entropy as an example but our framework is not restricted to a specific form of loss for the set of labeled points.: . Inequality constraints of the general form in (1) can embed very useful prior knowledge on the network predictions for unlabeled pixels. Assume, for instance, that we have prior knowledge about the size of the target region (i.e., class) . Such knowledge can be in the form of lower or upper bounds on size, which is common in medical image segmentation problems [12, 18, 8]. In this case, one can impose constraints of the form , with denoting an upper bound on the size of region . The same type of constraints can impose image-tag priors, a form of weak supervision enforcing whether a target region is present or absent in a given training image, as in multiple instance learning (MIL) scenarios [22, 12]. For instance, constraint of the form forces class to be present in a given training image.
1.2 Challenges of constrained CNN optimization
Even when the constraints are convex with respect to the network output, problem (1) is very challenging for deep CNNs, which typically have millions of trainable parameters in the case of semantic segmentation. In optimization, a standard way to handle constraints is to solve the Lagrangian primal and dual problems in an alternating scheme . For (1), this corresponds to alternating the optimization of a CNN for the primal with stochastic optimization, e.g., SGD, and projected gradient-ascent iterates for the dual. However, despite the clear benefits of imposing global constraints on CNNs, such a standard Lagrangian-dual optimization is mostly avoided in modern deep networks. As discussed recently in [22, 17, 24], this might be explained by two main reasons: (1) computational complexity and (2) stability/convergence issues caused by alternating between stochastic optimization and dual updates.
As pointed out in [22, 17, 12], imposing hard constraints on the outputs of deep CNNs is challenging. In standard Lagrangian-dual optimization methods, an unconstrained optimization problems needs to be solved after each iterative dual step. This is not feasible for deep CNNs, however, as it would require re-training the network at each step. To avoid this problem, Pathak et al.  introduce a latent distribution, and minimize a KL divergence so that the CNN output matches this distribution as closely as possible. Since the network’s output is not directly coupled with constraints, its parameters can be optimized using standard techniques like SGD. While this strategy enabled adding inequality constraints in weakly supervised segmentation, it is limited to linear constraints. Moreover, the work in 
imposes hard equality constraints on 3D human pose estimation. To alleviate computational complexity, Kyrlov sub-space approach is used to limit the solver to a randomly selected subset of constraints within each iteration. Therefore, constraints that are satisfied at one iteration may not be satisfied at the next, which might explain the negative results in the paper. In general, updating the network parameters and dual variables in an alternating fashion leads to a higher computational complexity than solving a loss function directly.
The second difficulty in Lagrangian optimization is the interplay between stochastic optimization (e.g., SGD) for the primal and the iterates/projections for the dual. Basic gradient methods have well-known issues with deep networks, e.g., they are sensitive to the learning rate and prone to weak local minima. Therefore, the dual part in Lagrangian optimization might obstruct the practical and theoretical benefits of stochastic optimization (e.g., speed and strong generalization performance), which are widely established for unconstrained deep network losses . More importantly, solving the primal and dual separately may lead to instability during training or slow convergence, as shown recently in .
1.3 Penalty approaches
In the context of deep networks, “hard” inequality or equality constraints are typically handled in a “soft” manner by augmenting the loss with a penalty function [10, 11, 12]. The penalty-based approach is a simple alternative to Lagrangian optimization, and is well-known in the general context of constrained optimization; see , Chapter 4. In general, such penalty-based methods approximate a constrained minimization problem with an unconstrained one by adding a term (penalty) , which increases when constraint is violated. By definition, a penalty is a non-negative, continuous and differentiable function, which verifies: if and only if constraint is satisfied. In semantic segmentation 
and, more generally, in deep learning, it is common to use a quadratic penalty for imposing an inequality constraint: , where denotes the rectifier function. Fig. 1 depicts an illustration of different choices of penalty functions. Penalties are convenient for deep networks because they remove the requirement for explicit Lagrangian-dual optimization. The inequality constraints are fully handled within stochastic optimization, as in standard unconstrained losses, avoiding gradient ascent iterates/projections over the dual variables and reducing the computational load for training . However, this simplicity of penalty methods comes at a price. In fact, it is well known that penalty methods do not guarantee constraint satisfaction and require careful and ad hoc tuning of the relative importance (or weight) of each penalty term in the overall function that is being minimized. More importantly, in the case of several competing constraints, penalties do not act as barriers at the boundary of the feasible set (i.e., a satisfied constraint yields a null penalty). As a result, a subset of constraints that are satisfied at one iteration may not be satisfied at the next. Lagrangian optimization can deal with these difficulties, and has several well-known theoretical and practical advantages over penalty-based methods [4, 5]: it finds automatically the optimal weights of the constraints and guarantees constraint satisfaction when feasible solutions exist. Unfortunately, as pointed out recently in [17, 12], these advantages of Lagrangian optimization do not materialize in practice in the context of deep CNNs. Apart from the computational-feasibility aspects, which the recent works in [17, 22] address to some extent with approximations, the performances of Lagrangian optimization are, surprisingly, below those obtained with simple, much less computationally intensive penalties [17, 12]. This is, for instance, the case of the recent weakly supervised CNN semantic segmentation results in , which showed that a simple quadratic-penalty formulation of inequality constraints outperforms substantially the Lagrangian method in . Also, the authors of  reported surprising results in the context of 3D human pose estimation. In their case, replacing the equality constraints with simple quadratic penalties yielded better results than Lagrangian optimization.
We leverage well-established concepts in interior-point methods, which approximate Lagrangian optimization with a sequence of unconstrained problems, while completely avoiding dual steps/projections. Specifically, we propose a sequence of unconstrained log-barrier-extension losses for approximating inequality-constrained CNN problems. The proposed extension has a duality-gap bound, which yields sub-optimality certificates for feasible solutions in the case of convex losses. While sub-optimality is not guaranteed for non-convex problems, the result shows that log-barrier extensions are a principled way to approximate Lagrangian optimization for constrained CNNs. Our approach addresses the well-known limitations of penalty methods and, at the same time, removes the explicit dual steps of Lagrangian optimization. We report comprehensive experiments showing that our formulation outperforms a recent penalty-based constrained CNN method , both in terms of accuracy and training stability.
2 Background on Lagrangian optimization and the log-barrier method
This section reviews both standard Lagrangian optimization and the log-barrier method for constrained problems . We also present basic concepts of duality theory, namely the duality gap and -suboptimality, which will be needed when introducing our log-barrier extension and the corresponding duality-gap bound. We also discuss the limitations of standard constrained optimization methods in the context of deep CNNs.
Lagrangian optimization: Let us first examine standard Lagrangian optimization for problem (1):
is the dual variable (or Lagrange-multiplier) vector, withthe multiplier associated with constraint . The dual function is the minimum value of Lagrangian (2) over : . A dual feasible yields a lower bound on the optimal value of constrained problem (1), which we denote : . This important inequality can be easily verified, even when the problem (1) is not convex; see , p. 216. It follows that a dual feasible gives a sub-optimality certificate for a given feasible point , without knowing the exact value of : . Nonnegative quantity is the duality gap for primal-dual pair . If we manage to find a feasible primal-dual pair such that the duality gap is less or equal than a certain , then primal feasible is -suboptimal.
A primal feasible point is -suboptimal when it verifies: .
This provides a non-heuristic stopping criterion for Lagrangian optimization, which alternates two iterative steps, one primal and one dual, each decreasing the duality gap until a given accuracyis attained222Strong duality should hold if we want to achieve arbitrarily small tolerance . Of course, strong duality does not hold in the case of CNNs as the primal problem is not convex.. In the context of CNNs , the primal step minimizes the Lagrangian w.r.t. , which corresponds to training a deep network with stochastic optimization, e.g., SGD: . The dual step is a constrained maximization of the dual function333Notice that the dual function is always concave as it is the minimum of a family of affine functions, even when the original (or primal) problem is not convex, as is the case for CNNs. via projected gradient ascent: . As mentioned before, direct use of Lagrangian optimization for deep CNNs increases computational complexity and can lead to instability or poor convergence due to the interplay between stochastic optimization for the primal and the iterates/projections for the dual. Our work approximates Lagrangian optimization with a sequence of unconstrained log-barrier-extension losses, in which the dual variables are implicit, avoiding explicit dual iterates/projections. Let us first review the basic barrier method.
The log-barrier method: The log-barrier method is widely used for inequality-constrained optimization, and belongs to the family of interior-point techniques . To solve our constrained CNN problem (1) with this method, we need to find a strictly feasible set of network parameters as a starting point, which can then be used in an unconstrained problem via the log-barrier function. In the general context of optimization, log-barrier methods proceed in two steps. The first, often called phase I , computes a feasible point by Lagrangian minimization of a constrained problem, which in the case of (1) is:
For deep CNNs with millions of parameters, Lagrangian optimization of problem (3) has the same difficulties as with the initial constrained problem in (1). To find a feasible set of network parameters, one needs to alternate CNN training and projected gradient ascent for the dual variables. This might explain why such interior-point methods, despite their substantial impact in optimization , are mostly overlooked in modern deep networks444Interior-point methods were investigated for artificial neural networks before the deep learning era ., as is generally the case for other Lagrangian-dual optimization methods.
The second step, often referred to as phase II, approximates (1) as an unconstrained problem:
where is the log-barrier function: . When , this convex, continuous and twice-differentiable function approaches a hard indicator for the constraints: if and otherwise; see Fig. 1 (a) for an illustration. The domain of the function is the set of feasible points. The higher , the better the quality of the approximation. This suggest that large yields a good approximation of the initial constrained problem in (1). This is, indeed, confirmed with the following standard duality-gap result for the log-barrier method , which shows that optimizing (4) yields a solution that is -suboptimal.
Proof: The proof can be found in , p. 566. ∎
An important implication that follows immediately from proposition (1) is that a feasible solution of approximation (4) is -suboptimal: . This suggests a simple way for solving the initial constrained problem with a guaranteed -suboptimality: We simply choose large and solve unconstrained problem (4). However, for large , the log-barrier function is difficult to minimize because its gradient varies rapidly near the boundary of the feasible set. In practice, log-barrier methods solve a sequence of problems of the form (4) with an increasing value . The solution of a problem is used as a starting point for the next, until a specified -suboptimality is reached.
3 Log-barrier extensions
We propose the following unconstrained loss for approximating Lagrangian optimization of constrained problem (1):
where is our log-barrier extension, which is convex, continuous and twice-differentiable:
Similarly to the standard log-barrier, when , our extension (6) can be viewed a smooth approximation of hard indicator function ; see Fig. 1 (b). However, a very important difference is that the domain of our extension is not restricted to feasible points . Therefore, our approximation (5) removes completely the requirement for explicit Lagrangian-dual optimization for finding a feasible set of network parameters. In our case, the inequality constraints are fully handled within stochastic optimization, as in standard unconstrained losses, avoiding completely gradient ascent iterates and projections over explicit dual variables. As we will see in the experiments, our formulation yields better results in terms of accuracy and stability than the recent penalty constrained CNN method in .
In our approximation in (5), the Lagrangian dual variables for the initial inequality-constrained problem of (1) are implicit. We prove the following duality-gap bound, which yields sub-optimality certificates for feasible solutions of our approximation in (5). Our result555Our result applies to the general context of convex optimization. In deep CNNs, of course, a feasible solution of our approximation may not be unique and is not guaranteed to be a global optimum as and the constraints are not convex. can be viewed as an extension of the standard result in proposition 1, which expresses the duality-gap as a function of for the log-barrier function.
Proof: We give a detailed proof in the supplemental material. ∎
From proposition 2, the following important fact follows immediately: If the solution that we obtain from unconstrained problem (5) is feasible and global, then it is -suboptimal for constrained problem (1): .
Finally, we arrive to our constrained CNN learning algorithm, which is fully based on SGD. Similarly to the standard log-barrier algorithm, we use a varying parameter . We optimize a sequence of losses of the form (5) and increase gradually the value by a factor . The network parameters obtained for the current
and epoch are used as a starting point for the nextand epoch. Steps of the proposed constrained CNN learning algorithm are detailed in Algorithm 1.
Both the proposed extended log-barrier and the penalty based baseline  are compatible with any differentiable function , including non-linear and fractional terms (such as Equations (8) and (9) introduced further in the paper). However, we hypothesize that our log-barrier extension is better to handle the interplay between multiple constraints. To validate this hypothesis, we compare both strategies on the joint optimization of two segmentation constraints related to region size and centroid.
We define the size of a segmentation for class as the sum of its predictions:
Notice that we use the softmax predictions to compute , as using values after thresholding would not be differentiable. In practice, we can make the network predictions to be near binary using a large enough temperature parameter in the softmax. We bound the function such as
where , corresponding to “individual bounds” , i.e., specific bounds are determined for each image from its ground truth .666Since we focus on methods to constraint the training of a deep neural network, we do not study how we can realistically get such bounds without complete annotations. This is left as future work.
The centroid of the predicted object can be computed as a weighted average of the pixel coordinates:
where are the pixel coordinates on a 2D grid. We constraint the position of the centroid in a box around the ground truth centroid:
where , being the bound values associated to each image.
4.1 Datasets and evaluation metrics
The proposed loss is evaluated in three different segmentation scenarios using synthetic, medical and color images. Datasets used in each of these problems are detailed below.
Synthetic images: We generated a synthetic dataset composed of 1100 images with two different circles of same size but different color, and different levels of Gaussian noise added to the whole image. The target object is the darker circle. From these images, 1000 were employed for training and 100 for validation. See Figure 4, first column for illustration. No pixel annotation is used during training (). The objective of this simple dataset is to compare our log-barrier extension with the penalty-based approach of , in three different constraint settings: 1) only size, 2) only centroid, and 3) both constraints. For the first two settings, we expect both methods to fail since the corresponding segmentation problems are under-determined (e.g., size is not sufficient to determine which circle is the correct one). On the other hand, the third setting provides enough information to segment the right circle, and the main challenge here is the interplay between the two different constraints.
Medical images: We use the PROMISE12  dataset, which was made available for the MICCAI 2012 prostate segmentation challenge. Magnetic Resonance (MR) images (T2-weighted) of 50 patients with various diseases were acquired at different locations with several MRI vendors and scanning protocols. We hold 10 patients for validation and use the rest for training. As in , we use partial cross entropy for the weakly supervised setting, with weak labels derived from the ground truth by placing random dots inside the object of interest (Fig. 2). As this dataset is already fairly centerered around the object, we impose constraints only on the size of the object (see Eq. (8)).
Color images: We also evaluate our method on the Semantic Boundaries Dataset (SBD), which can be seen as a scaling of the original PascalVOC segmentation benchmark. We employed the 20 semantic categories described on PascalVOC. This dataset contains 11318 images fully annotated, divided into 8498 training and 2820 test images. We obtained the scribble annotations from the public repository from ScribbleSup  and took the intersection between both datasets for our experiments. Thus, a total of 8829 images were used for training, and 1449 for validation.
For the synthetic and PROMISE12 datasets, we resort to the common Dice index (DSC) = to evaluate the performance of tested methods. For PascalVOC, we follow most studies on this dataset and use the mean Intersection over Union (mIoU) metric.
4.2 Training and implementation details
Since the three datasets have very different characteristics, we considered a specific network architecture and training strategy for each of them.
For the dataset of synthetic images, we used the ENet network , as it has shown a good trade-off between accuracy and inference time. The network was trained from scratch using the Adam optimizer and a batch size of 1. The initial learning rate was set to and decreased by half if validation performance did not improve for 20 epochs. Softmax temperature value was set to 5. To segment the prostate, we used the same settings as in 
, reporting their results for the penalty-based baselines. For PascalVOC, we used a Pytorch implementation of the FCN8s model, built upon a pre-trained VGG16  from the Torchvision model zoo777https://pytorch.org/docs/stable/torchvision/models.html. We trained this network with a batch size of 1 and a constant learning rate of 10-5 over time. Regarding the weights of the penalty and log-barrier terms we investigated several values and we obtained the best performances with 10-4 and 10-2, respectively.
For all tasks, we set to 5 the initial value of our extended log-barrier (Algorithm 1), and increased it by a factor of after each epoch. This strategy relaxes constraints in the first epochs so that the network can focus on learning from images, and then gradually makes these constraints harder as optimization progresses. Experiments on the toy example and PascalVOC were implemented in Python 3.7 with PyTorch 1.0.1 , whereas we followed the same specifications than  for the prostate experiments, employing Python 3.6 with PyTorch 1.0. All the experiments were carried out on a server equipped with a NVIDIA Titan V. The code is publicly available at https://github.com/LIVIAETS/extended_logbarrier.
The following sections report the experimental results of the proposed extended log-barrier method on the three datasets introduced in Sec. 4.1.
4.3.1 Synthetic images
Results on the synthetic example for both penalty-based and the extended log-barrier approaches are reported in Table 1. As expected, constraining the size only is not sufficient to locate the correct circle (2nd and 5th columns in Fig. 4), which explains the very low DSC values in Figure 2(a). However, we observe that the two optimization methods lead to very different solutions: sparse unconnected dots for the penalty method and continuous shape for the log-barrier method. This difference could be due to the high gradients of the penalty method in the first iterations, which strongly biases the network toward bright pixels. On the other hand, constraining only the centroid results in a correctly located region, but without the correct boundary (3rd and 6th columns in Figure 4). The most interesting scenario is when both the size and centroid are constrained. In Figure 2(a), we can see that the penalty constrained network is unstable during training, and has worse results than the log-barrier method (4th and 7th columns). This demonstrates the barrier’s effectiveness to prevent predictions from going out of bounds (Fig. 1), thereby making optimization more stable.
|Method||Size||Centroid||Size & Centroid|
4.3.2 PROMISE12 dataset
Quantitative results on the prostate segmentation task are reported in Table 2 (left column). If no prior information is inferred, i.e., only scribbles, the trained model completely fails to achieve a satisfactory performance, with a mean Dice coefficient of 0.032. It can be observed that integrating the target size during training significantly improves performance. While constraining the predicted segmentation with a penalty-based method  achieves a DSC value of nearly 0.83, imposing the constraints with our log-barrier extension increased the performance by an additional 2%. The use of the extended log-barrier to constraint the CNN predictions reduces the gap towards the full supervised model, with only 4% of difference between both.
In terms of optimization, the extended log barrier method needs more iterations to converge, eventually surpassing the penalty-based method (Fig. 2(b)). This may be due to the scheduling factor of the extended log-barrier, whose contribution slowly increases over time. This avoids large gradients early in the training, which can overshot the minimum, failing to converge. On the other hand, when becomes a high value, the constrained function has been slowly pushed toward the desired bounds, leading to small gradient updates. In this point, the barrier will prevent the network to predict segmentations out of bounds, ensuring stability and better performance.
|Method||PROMISE12 (DSC)||VOC2012 (mIoU)|
|Partial cross-entropy||0.032 (0.015)||48.48 (14.88)|
|w/ penalty ||0.830 (0.057)||52.22 (14.94)|
|w/ extended log-barrier||0.852 (0.038)||53.40 (14.62)|
|Full supervision||0.891 (0.032)||59.87 (16.94)|
Mean and standard deviation on the validation set of the PROMISE12 and PascalVOC datasets when networks are trained with several levels of supervision.
Table 2 (right column) compares the numeric results of our approach with those of only scribble annotations and naive penalty approaches, in the scenario of size constraints. For reference, we also include the results of full supervised trained, which serves as an upper bound. From the results, we can see that the penalty-based method improves the performance of scribble annotations by approximately 4%, in terms of mIoU. If we use the proposed extended log-barrier method, the mIoU increases up to 53.4%, representing a 1.2% of improvement with respect the penalty-bases strategy and only 6.4% of gap compared to full supervision. Visual results (Fig. 6) show that the proposed framework for constraining the CNN training helps to reduce over-segmentation, reducing the amount of false positives.
We introduced an extended log-barrier to constrain deep CNNs, which avoids the difficult step in classical interior point methods of finding an initial feasible solution. We demonstrate its effectiveness over a penalty-based method , on several segmentation tasks and with different constraints settings, including linear and non-linear functions.
In our experiments, we derived size bounds and centroids from segmentation ground truth. Future work could investigate automated techniques to obtain this information directly from input images, for instance using a regression network. Another interesting extension of this work would be to test a broader set of constraints, for example, related to region connectivity or compactness.
We gratefully thank NVIDIA for its GPU donations (TITAN V and GTX 1080 Ti). This work is supported by the National Science and Engineering Research Council of Canada (NSERC), discovery grant program.
A. L. Bearman, O. Russakovsky, V. Ferrari, and F. Li.
What’s the point: Semantic segmentation with point supervision.
European Conference on Computer Vision (ECCV), pages 549–565, 2016.
-  D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995.
-  S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
-  R. Fletcher. Practical Methods of Optimization. John Wiley & Sons, 1987.
-  P. Gill, W. Murray, and M. Wright. Practical Optimization. Academic Press, 1981.
R. Girshick, J. Donahue, T. Darrell, and J. Malik.
Rich feature hierarchies for accurate object detection and semantic
Conference on computer vision and pattern recognition, pages 580–587, 2014.
-  I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.
-  L. Gorelick, F. R. Schmidt, and Y. Boykov. Fast trust region for segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1714–1721, 2013.
M. Hardt, B. Recht, and Y. Singer.
Train faster, generalize better: Stability of stochastic gradient descent.In
International Conference on Machine Learning (ICML), pages 1225–1234, 2016.
F. S. He, Y. Liu, A. G. Schwing, and J. Peng.
Learning to play in a day: Faster deep reinforcement learning by optimality tightening.In International Conference on Learning Representations (ICLR), pages 1–13, 2017.
-  Z. Jia, X. Huang, E. I. Chang, and Y. Xu. Constrained deep weak supervision for histopathology image segmentation. IEEE Transactions on Medical Imaging, 36(11):2376–2388, 2017.
-  H. Kervadec, J. Dolz, M. Tang, E. Granger, Y. Boykov, and I. B. Ayed. Constrained-cnn losses for weakly supervised segmentation. Medical Image Analysis, 2019.
-  D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 3159–3167, 2016.
-  G. Litjens, R. Toth, W. van de Ven, C. Hoeks, S. Kerkstra, B. van Ginneken, G. Vincent, G. Guillard, N. Birbeck, J. Zhang, et al. Evaluation of prostate segmentation algorithms for mri: the promise12 challenge. Medical Image Analysis, 18(2):359–373, 2014.
-  G. J. S. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. W. M. van der Laak, B. van Ginneken, and C. I. Sánchez. A survey on deep learning in medical image analysis. Medical Image Analysis, 42:60–88, 2017.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015.
-  P. Márquez-Neila, M. Salzmann, and P. Fua. Imposing Hard Constraints on Deep Networks: Promises and Limitations. In CVPR Workshop on Negative Results in Computer Vision, pages 1–9, 2017.
-  M. Niethammer and C. Zach. Segmentation with area constraints. Medical Image Analysis, 17(1):101–112, 2013.
-  G. Papandreou, L. Chen, K. P. Murphy, and A. L. Yuille. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In International Conference on Computer Vision (ICCV), pages 1742–1750, 2015.
-  A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147, 2016.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
-  D. Pathak, P. Krahenbuhl, and T. Darrell. Constrained convolutional neural networks for weakly supervised segmentation. In International Conference on Computer Vision (ICCV), pages 1796–1804, 2015.
-  M. Rajchl, M. C. Lee, O. Oktay, K. Kamnitsas, J. Passerat-Palmbach, W. Bai, M. Damodaram, M. A. Rutherford, J. V. Hajnal, B. Kainz, et al. Deepcut: Object segmentation from bounding box annotations using convolutional neural networks. IEEE Transactions on Medical Imaging, 36(2):674–683, 2017.
-  S. N. Ravi, T. Dinh, V. Sai, R. Lokhande, and V. Singh. Constrained deep learning using conditional gradient and applications in computer vision. arXiv:1803.0645, 2018.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Neural Information Processing Systems (NeurIPS), pages 568–576, 2014.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015.
-  M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers. Normalized Cut Loss for Weakly-supervised CNN Segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1818–1827, 2018.
-  M. Tang, F. Perazzi, A. Djelouah, I. Ben Ayed, C. Schroers, and Y. Boykov. On Regularized Losses for Weakly-supervised CNN Segmentation. In European Conference on Computer Vision (ECCV), Part XVI, pages 524–540, 2018.
-  T. B. Trafalis, T. A. Tutunji, and N. P. Couellan. Interior point methods for supervised training of artificial neural networks with bounded weights. In Network Optimization, pages 441–470, 1997.
-  J. Weston, F. Ratle, H. Mobahi, and R. Collobert. Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655. Springer, 2012.
Proof for proposition 2
In this section, we provide a detailed proof for the duality-gap bound for proposal 2 in the paper. Recall our unconstrained approximation for inequality-constrained CNN:
where is our log-barrier extension, with strictly positive. Let be the solution of problem (10) and the corresponding vector of implicit dual variables given by:
We assume that verifies approximately888When optimizing unconstrained loss via stochastic gradient descent (SGD), there is no guarantee that the obtained solution verifies exactly the optimality conditions. the optimality condition for a minimum of (10):
It is easy to verify that each dual variable corresponds to the derivative of the log-barrier extension at : . Therefore, (12) means that verifies approximately the optimality condition for the Lagrangian corresponding to the original inequality-constrained problem when :
It is also easy to check that the implicit dual variables defined in (11) corresponds to a feasible dual, i.e., element-wise. Therefore, the dual function evaluated at is:
which yields the duality gap associated with primal-dual pair :
Now, to prove that this duality gap is upper-bounded by , we consider three cases for each term in the sum in (14) and verify that, for all the cases, we have .
In all the three cases, we have . Summing this inequality over gives . Using this inequality in (14) yields the following upper bound on the duality gap associated with primal and implicit dual feasible for the original inequality-constrained problem: ∎
This bound yields sub-optimality certificates for feasible solutions of our approximation in (10). If the solution that we obtain from our unconstrained problem (10) is feasible, i.e., it satisfies constraints , , then is -suboptimal for the original inequality constrained problem: . Our upper-bound result can be viewed as an extension of the duality-gap equality for the standard log-barrier function . Our result applies to the general context of convex optimization. In deep CNNs, of course, a feasible solution for our approximation may not be unique and is not guaranteed to be a global optimum as and the constraints are not convex.
Training curves for the three datasets
Figure 7 shows the learning curves on the training set.
In Subfigure 6(b), the behaviour of the network trained with the extended log-barrier is particularly striking – the initially low level of performance gradually achieves a considerably higher training dice than its penalty counterpart.
Visual results on PROMISE12 validation set
Figure 8 shows more results on the validation set.
Visual results on PascalVOC validation set
Figure 9 introduces more results on PascalVOC validation set.