1 Introduction
In the recent years, image semantic segmentation has received considerable attention in the medical imaging research community, and almost all contemporary methods rely on deep learning and fully convolutional neural networks
[Ronneberger et al.(2015)Ronneberger, Fischer, and Brox, Litjens et al.(2017)Litjens, Kooi, Bejnordi, Setio, Ciompi, Ghafoorian, Van Der Laak, Van Ginneken, and Sánchez]. While network architectures have been extensively studied [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox, Milletari et al.(2016)Milletari, Navab, and Ahmadi, Chen et al.(2017)Chen, Papandreou, Kokkinos, Murphy, and Yuille], the loss functions used to train them have received relatively less attention, and most of the existing methods rely on variants of either the crossentropy
[Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] or dice loss [Sudre et al.(2017)Sudre, Li, Vercauteren, Ourselin, and Cardoso, Milletari et al.(2016)Milletari, Navab, and Ahmadi]. Ultimately, all those losses are actually performing “pixelwise classification”, and do not account for the image spatial domain—for instance, standard implementations in popular frameworks completely discard the image dimension^{1}^{1}1https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html. Other methods that take into account the distances to the segmentation boundary [Kervadec et al.(2019a)Kervadec, Bouchtiba, Desrosiers, Granger, Dolz, and Ayed], or weakly supervised methods that have access to only partial/uncertain annotations [Qu et al.(2019)Qu, Wu, Huang, Yi, Riedlinger, De, and Metaxas, Rajchl et al.(2016)Rajchl, Lee, Oktay, Kamnitsas, PasseratPalmbach, Bai, Damodaram, Rutherford, Hajnal, Kainz, et al., Papandreou et al.(2015)Papandreou, Chen, Murphy, and Yuille, Bearman et al.(2016)Bearman, Russakovsky, Ferrari, and FeiFei, Lin et al.(2016)Lin, Dai, Jia, He, and Sun]—in a weakly supervised segmentation setting—eventually supervise a subset of pixels individually. Informally, we could say that the existing segmentation methods are “micromanaging” pixels,treating each one as a separate classification problem, instead of supervising the global shape information of segmentation prediction. (Some additional related methods are discussed in Appendix A.)The traditional computer vision literature abounds of global mathematical descriptions that characterize the shapes of objects
[Nayak and Stojmenovic(2008)], for instance, shape moments, length, total variation, Fourier transforms, etc. It has also been shown that descriptions based on a few geometric shape moments could be enough to reconstruct complex shapes
[Milanfar et al.(2000)Milanfar, Putinar, Varah, Gustafsson, and Golub], via solving an inverse problem. Furthermore, such geometric shape moments could be made invariant with respect to geometric transformation (e.g., rotation, translation and scaling) by pure mathematical manipulations, which is convenient for segmentation [Klodt and Cremers(2011)]. This includes the wellknown Hu’s invariant moments [Hu(1962)]. While less popular today in computer vision than they used to be, those remain powerful regularization and shapedescription tools for segmentation methods. So powerful, perhaps, that they could be fully used to characterize the objects that we want to segment, while providing intrinsic invariance; in short, supervising the overall shape prediction of a segmentation networks, not through individual pixels but rather global shape descriptions.This paper studies how effective global geometric shape descriptors can be, when used on their own as segmentation losses for training deep neural networks. Not only interesting theoretically, there exist deeper motivations to posing segmentation problems as a reconstruction of shape descriptors: First, annotations to obtain approximations of loworder shape moments could be much less cumbersome than their fullmask counterparts (e.g., from a few mouse clicks by the user). Furthermore, anatomical priors can also be readily translated into shape descriptions, which is not feasible when using dense label masks. This might alleviate the annotation burden for training deep segmentation networks. Also, some shape descriptors could be readily used to “encode” biomarkers, leading to better interpretability. Finally, and most importantly, we hypothesize that, given a task, certain shape descriptions might be invariant across image acquisition protocols/modalities and subject populations, which might open interesting research avenues for generalization in segmentation.
Our contributions can be summarized as follow:

we reintroduce and reformulate different shape descriptors, in the context of deep semantic segmentation;

inspired by recent works in inequality constraints, we propose a way to use those descriptors to supervise deep neural networks;

as such, we benchmark a combination of those descriptors and show that—surprisingly—using only a few shape descriptors can go a long way, even in more complex settings (Figure 1). In fact, we found that as little as 4 descriptors values per class could approach the performance of a segmentation mask with 65k individual discrete labels.;

we discuss future research directions that could benefit from those surprising findings.

(16 continuous values) 
2 Formulation
2.1 Notation and background
Let denotes the image spatial domain^{2}^{2}2For readability and simplicity, we detail here only the case of 2Dimages, but the method could be extended to dimensions in a straightforward way. and an input image, with its associated ground truth. Here, refers to the , and
to its vertices, i.e., a onehot encoding for
classes. Our goal is to train a network , with parameterspredicting a dense probability map
; denotes the predicted softmax probability for class at pixel . For each pixel , its coordinates in the 2D space are represented by the tuple .Shape and central moments have been widely studied in traditionnal computer vision [Nayak and Stojmenovic(2008), Milanfar et al.(2000)Milanfar, Putinar, Varah, Gustafsson, and Golub], where they can be used to characterize a shape. Each moment is parametrized by its orders , and each order represents a different characteristic of the shape.
Shape moment
Shape moments can be defined, in their general form, as functions of a deepnetwork softmax predictions for a given class as follows:
where are the moment orders.
Central moment
The central moment is closely related to the shape moment, with the difference being that coordinates and are shifted by their respective centroids, for translation invariance (more details in the next subsection). It is given by:
Image Laplacian
The Laplacian of an image is defined by the underlying graph structure , which describes the connectivity between each pair of pixels. A sparse graph (i.e., each pixel is connected only to its 4 or 8 direct neighbors) is often used and can be encoded with a sparse adjacency matrix , where means that are neighbors, and means that they are not. The Laplacian is constructed directly from :
where builds a diagonal matrix and encodes the number of neighbors for each pixel . For a 8neighbors connectivity, this number will be the same for all pixels, except the image edges which will have less. Notice that depends only on the image spatial domain, but not the image values. As such, it can be efficiently precomputed and cached for all the samples in a dataset—assuming they share the same resolution. There exists some edgesensitive variants, which define so that it accounts for pixel similarities (e.g., intensity differences), but this is beyond the scope of this study.
2.2 Shape descriptors
With those different standard building blocks, it is possible to define shape descriptors, measuring actual properties of the object, rather than listing a list of pixels that should belong to it. All following descriptors hold for some input image and some class :
Volume
The volume of the predicted segmentation is simply a summation of the predicted probabilities—which is a special case of shape moments. As such:
Centroid
The centroid of a class can be computed by dividing the first shape moment by the volume. We can see it as the average of the pixelcoordinates for class :
Average distance to the centroid
It measures how far the object should spread around its centroid, on average
. It is the standard deviation of pixelcoordinates for class
:Length
The length of a segmentation, or rather, the length of its boundary, can be efficiently computed by reusing the precomputed image Laplacian. To summarily describe it, each difference of classification between two neighbors will be counted as , while neighbors with the same predicted class will count as ; which is a standard Potts model. It is trivial to relax this definition to plug the predicted (continuous) probabilities:
Ratio of descriptors
In the multiclass setting, some relationships between different classes might be known in advance, using anatomical priors, for instance. While exact values are not necessarily required, inequalities could provide useful information. As such, we can define an additional descriptors for pairs of classes and , for a specific descriptor :
2.3 Supervision with constraints
Instead of optimizing a pixelwise loss, we design loss functions, which penalize the deviations between the global shape descriptors computed from the predicted segmentation and those corresponding to the ground truth, e.g., . This could be formulated as a hard equalityconstrained optimization problem.
Here, we propose to relax the constraint to add a lower and upper bound centered around the ground truth value (this may mimic imprecise information about shape descriptors when these are derived, for instance, from anatomical prior knowledge and not from ground truth):
_θ &L_θ
subject to & 0.9 τ_f^(k) ≤f^(k)(s_θ) ≤1.1τ_f^(k) &∀k, ∀f ∈{ V, C, D, L }
& a ≤R_f^(k,l) ≤b & for some f,a,b,k,l,
where .
In the context of deep neural networks, standard constrained optimization techniques (such as Lagrangian or interiorpoint methods) are not directly applicable for tractability reasons. The inequality constraints can be tackled directly as a loss function using a logbarrierextension penalty (details can be found in Section B), controlled by a parameter that is increased over time to make the bounds tighter and tighter. Such logbarrier penalties were introduced recently in [Kervadec et al.(2019c)Kervadec, Dolz, Yuan, Desrosiers, Granger, and Ayed] in the general optimization context for constrained deep networks. As, for the sake of the study, we want to completely forego pixelwise supervision, we set . As such, our final model is:
(1) 
Bounds for can be included in the same fashion, if available and relevant—depending on the task at hand. The bound values for do not rely on , but rather on expert knowledge about the task. We will give some examples in the next section.
3 Experiments
3.1 Datasets
Heart segmentation on cineMRI
The main dataset that we use in our experiments is the publicly available 2017 ACDC Challenge [Bernard et al.(2018)Bernard, Lalande, Zotti, Cervenansky, Yang, Heng, Cetin, Lekadir, Camara, Ballester, et al.], which contains 4 classes to segment: left and right ventricles, myocardium, and background. The dataset consists of 100 cine magnetic resonance (MR) exams covering well defined pathologies: dilated cardiomyopathy, hypertrophic cardiomyopathy, myocardial infarction with altered left ventricular ejection fraction and abnormal right ventricle. It also included normal subjects. We chose this dataset because it is a good benchmark for shape descriptors. Not only a multiclass setting, the myocardium and leftventricle share a common centroid, and the myocardium completely surround the leftventricle—which is more challenging to describe. We constraint , and . Moreover, the relationship between myocardium and leftventricle can be formulated with the following bounds on their relative length: . We retained 70 exams for training, 10 for validation and 20 for testing.
Prostate segmentation on MRT2
The second dataset that we use is the Promise12 challenge [Litjens et al.(2014)Litjens, Toth, van de Ven, Hoeks, Kerkstra, van Ginneken, Vincent, Guillard, Birbeck, Zhang, et al.]. It contains the transversal T2weighted MR images of 50 patients acquired at different centers, with multiple MRI vendors and different scanning protocols. The images include patients with benign diseases, as well as with prostate cancer. We employed 35 patients for training, 5 for validation, and 10 for testing. The difficulty of this dataset lies in its lowcontrast, and the very variable shape of the prostate. We supervise .
3.2 Implementation details
We use the ENet architecture [Paszke et al.(2016)Paszke, Chaurasia, Kim, and Culurciello] for experiences on ACDC, and a modified fully residual UNet for the experience on Promise12—the prostate is a harder task that requires a more powerful network, and it also enables us to validate the supervision method on different network architectures. We perform blurring, shifting, and scaling as online data augmentations, and we use the same network initialization for all settings, with the same scheduler and hyperparameters (Adam scheduler [Kingma and Ba(2014)], with learning rate of 5e4 and ). The shape descriptors are computed from the annotated mask, and we use a relaxed value by
as bounds. Most of the implementation was done in the PyTorch framework, and experiments were run on a
NvidiaTitan RTX. All descriptors can be efficiently vectorized, resulting in minimal slowdown during training (less than 10% compared to a crossentropy loss). The computation of the Laplacian
is done once per image shape (usually a single one per dataset after preprocessing), and cached using standard Python utilities (lru_cache
from functools
).
Our code is publicly available at https://github.com/hkervadec/shape_descriptors, and can easily be extended to other shape descriptors.
4 Results
Surprisingly, using only a few shape descriptors in place of dense pixelwise supervision is capable to segment the objects of interest, as we can see in Figure 2. On ACDC, what remains the most difficult to learn is the hierarchy between the leftventricle and its surrounding myocardium: some noisy myocardium pixels can sometime remain inside the predicted leftventricle. Nonetheless, we can consider that the network has properly learned the overall structure of the heart. On Promise12, the task is difficult even for crossentropy with full annotations. Despite the more powerful network used, the lowcontrast can still trick both methods. Nevertheless, supervision with shape descriptors is capable to predict a rough location and shape of the prostate, which is much more than we initially expected. Actual testing DSC values can be found in Table 1, and the plots of training and validation metrics over time can be found in Appendix C.
ACDC  Promise12  

Method  RV  Myo  LV  Overall  Prostate 
Crossentropy (pixelwise)  0.879 (0.066)  0.829 (0.074)  0.919 (0.059)  0.876 (0.076)  0.871 (0.047) 
Ours (shape descriptors)  0.825 (0.107)  0.660 (0.114)  0.819 (0.086)  0.768 (0.128)  0.651 (0.098) 
5 Discussion and conclusion
We have shown that simple and light shape descriptors can be effective supervision tools for semantic segmentation, allowing us to avoid completely pixelwise supervision; and proving how powerful shape descriptors can be. In a multiclass setting, the neural network is able to learn the inherent relationship between classes and the anatomical structure of the heart.
While not needed on the two datasets that we benchmarked on, it is very easy to compute the orientation and elongation of an object [Nayak and Stojmenovic(2008)], which would be very useful for certain tasks (for instance, esophagus segmentation). Spatial relationship between classes, that would be translation invariant, could be very beneficial in some settings, such as the cosegmentation of esophagus and trachea—both long objects next to each others.
We found empirically that using only shape descriptors without online data augmentation was more sensitive to network initialization than its pixelwise counterpart. It is entirely plausible that the random networks’ initializations, designed and tuned with crossentropy in mind [Sutskever et al.(2013)Sutskever, Martens, Dahl, and Hinton], are not optimal for shape descriptors. As such, future works could investigate other network initialization strategies.
A main limitation of the method is its inability to be subpatched and processed in different batches (any loss requiring a sum over an area bigger than the current patch shares this limitation, including the very popular Dice loss and its derivatives.) Recently, for a similar illsuited problem (enforcing a prior of the distribution of the classes, over the whole training set), [Zhou et al.(2019)Zhou, Li, Bai, Wang, Chen, Han, Fishman, and Yuille] showed that a primaldual approach can be a promising avenue.
We believe that we barely scratched the surface for the potential of invariant shape descriptors: shape and central moments orders can go much higher than two. Depending on the task, some invariant and higherorder descriptors could be common to all the samples and would not require additional annotations, but rather exploit existing anatomical knowledge. Timeseries could also benefit, as some descriptors might not vary across time, reducing the annotation burden and enabling the reuse of previously computed statistics. We describe such a setting in Appendix D. All in all, this might open interesting avenues for generalization across subject populations and acquisition protocols.
References
 [Bearman et al.(2016)Bearman, Russakovsky, Ferrari, and FeiFei] Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li FeiFei. What’s the point: Semantic segmentation with point supervision. In European conference on computer vision, pages 549–565. Springer, 2016.
 [Bernard et al.(2018)Bernard, Lalande, Zotti, Cervenansky, Yang, Heng, Cetin, Lekadir, Camara, Ballester, et al.] Olivier Bernard, Alain Lalande, Clement Zotti, Frederick Cervenansky, Xin Yang, PhengAnn Heng, Irem Cetin, Karim Lekadir, Oscar Camara, Miguel Angel Gonzalez Ballester, et al. Deep learning techniques for automatic mri cardiac multistructures segmentation and diagnosis: is the problem solved? IEEE transactions on medical imaging, 37(11):2514–2525, 2018.
 [Chen et al.(2017)Chen, Papandreou, Kokkinos, Murphy, and Yuille] LiangChieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.

[Hu(1962)]
MingKuei Hu.
Visual pattern recognition by moment invariants.
IRE transactions on information theory, 8(2):179–187, 1962.  [Kervadec et al.(2019a)Kervadec, Bouchtiba, Desrosiers, Granger, Dolz, and Ayed] Hoel Kervadec, Jihene Bouchtiba, Christian Desrosiers, Eric Granger, Jose Dolz, and Ismail Ben Ayed. Boundary loss for highly unbalanced segmentation. In International conference on medical imaging with deep learning, pages 285–296. PMLR, 2019a.
 [Kervadec et al.(2019b)Kervadec, Dolz, Tang, Granger, Boykov, and Ayed] Hoel Kervadec, Jose Dolz, Meng Tang, Eric Granger, Yuri Boykov, and Ismail Ben Ayed. Constrainedcnn losses for weakly supervised segmentation. Medical image analysis, 54:88–99, 2019b.
 [Kervadec et al.(2019c)Kervadec, Dolz, Yuan, Desrosiers, Granger, and Ayed] Hoel Kervadec, Jose Dolz, Jing Yuan, Christian Desrosiers, Eric Granger, and Ismail Ben Ayed. Constrained deep networks: Lagrangian optimization via logbarrier extensions. arXiv preprint arXiv:1904.04205, 2019c.
 [Kingma and Ba(2014)] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [Klodt and Cremers(2011)] Maria Klodt and Daniel Cremers. A convex framework for image segmentation with moment constraints. In 2011 International Conference on Computer Vision, pages 2236–2243. IEEE, 2011.
 [Lin et al.(2016)Lin, Dai, Jia, He, and Sun] Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. Scribblesup: Scribblesupervised convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3159–3167, 2016.
 [Litjens et al.(2014)Litjens, Toth, van de Ven, Hoeks, Kerkstra, van Ginneken, Vincent, Guillard, Birbeck, Zhang, et al.] Geert Litjens, Robert Toth, Wendy van de Ven, Caroline Hoeks, Sjoerd Kerkstra, Bram van Ginneken, Graham Vincent, Gwenael Guillard, Neil Birbeck, Jindang Zhang, et al. Evaluation of prostate segmentation algorithms for mri: the promise12 challenge. Medical image analysis, 18(2):359–373, 2014.
 [Litjens et al.(2017)Litjens, Kooi, Bejnordi, Setio, Ciompi, Ghafoorian, Van Der Laak, Van Ginneken, and Sánchez] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I Sánchez. A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88, 2017.
 [Milanfar et al.(2000)Milanfar, Putinar, Varah, Gustafsson, and Golub] Peyman Milanfar, Mihai Putinar, James Varah, Bjoern Gustafsson, and Gene H Golub. Shape reconstruction from moments: theory, algorithms, and applications. In Advanced Signal Processing Algorithms, Architectures, and Implementations X, volume 4116, pages 406–416. International Society for Optics and Photonics, 2000.
 [Milletari et al.(2016)Milletari, Navab, and Ahmadi] Fausto Milletari, Nassir Navab, and SeyedAhmad Ahmadi. Vnet: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. IEEE, 2016.
 [Milletari et al.(2017)Milletari, Rothberg, Jia, and Sofka] Fausto Milletari, Alex Rothberg, Jimmy Jia, and Michal Sofka. Integrating statistical prior knowledge into convolutional neural networks. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 161–168. Springer, 2017.
 [Nayak and Stojmenovic(2008)] Amiya Nayak and Ivan Stojmenovic. 2d shape measures for computer vision. 2008.
 [Nguyen and Ray(2020)] Nhat M Nguyen and Nilanjan Ray. Endtoend learning of convolutional neural net and dynamic programming for left ventricle segmentation. In Medical Imaging with Deep Learning, pages 555–569. PMLR, 2020.
 [Oktay et al.(2017)Oktay, Ferrante, Kamnitsas, Heinrich, Bai, Caballero, Cook, De Marvao, Dawes, O‘Regan, et al.] Ozan Oktay, Enzo Ferrante, Konstantinos Kamnitsas, Mattias Heinrich, Wenjia Bai, Jose Caballero, Stuart A Cook, Antonio De Marvao, Timothy Dawes, Declan P O‘Regan, et al. Anatomically constrained neural networks (acnns): application to cardiac image enhancement and segmentation. IEEE transactions on medical imaging, 37(2):384–395, 2017.

[Papandreou et al.(2015)Papandreou, Chen, Murphy, and
Yuille]
George Papandreou, LiangChieh Chen, Kevin P Murphy, and Alan L Yuille.
Weaklyand semisupervised learning of a deep convolutional network for semantic image segmentation.
In Proceedings of the IEEE international conference on computer vision, pages 1742–1750, 2015.  [Paszke et al.(2016)Paszke, Chaurasia, Kim, and Culurciello] Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. Enet: A deep neural network architecture for realtime semantic segmentation. arXiv preprint arXiv:1606.02147, 2016.
 [Qu et al.(2019)Qu, Wu, Huang, Yi, Riedlinger, De, and Metaxas] Hui Qu, Pengxiang Wu, Qiaoying Huang, Jingru Yi, Gregory M Riedlinger, Subhajyoti De, and Dimitris N Metaxas. Weakly supervised deep nuclei segmentation using points annotation in histopathology images. In International Conference on Medical Imaging with Deep Learning, pages 390–400. PMLR, 2019.
 [Rajchl et al.(2016)Rajchl, Lee, Oktay, Kamnitsas, PasseratPalmbach, Bai, Damodaram, Rutherford, Hajnal, Kainz, et al.] Martin Rajchl, Matthew CH Lee, Ozan Oktay, Konstantinos Kamnitsas, Jonathan PasseratPalmbach, Wenjia Bai, Mellisa Damodaram, Mary A Rutherford, Joseph V Hajnal, Bernhard Kainz, et al. Deepcut: Object segmentation from bounding box annotations using convolutional neural networks. IEEE transactions on medical imaging, 36(2):674–683, 2016.
 [Ray et al.(2012)Ray, Acton, and Zhang] Nilanjan Ray, Scott T Acton, and Hong Zhang. Seeing through clutter: Snake computation with dynamic programming for particle segmentation. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pages 801–804. IEEE, 2012.
 [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pages 234–241. Springer, 2015.
 [Sudre et al.(2017)Sudre, Li, Vercauteren, Ourselin, and Cardoso] Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M Jorge Cardoso. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep learning in medical image analysis and multimodal learning for clinical decision support, pages 240–248. Springer, 2017.

[Sutskever et al.(2013)Sutskever, Martens, Dahl, and
Hinton]
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.
On the importance of initialization and momentum in deep learning.
In
International conference on machine learning
, pages 1139–1147. PMLR, 2013.  [Zhou et al.(2019)Zhou, Li, Bai, Wang, Chen, Han, Fishman, and Yuille] Yuyin Zhou, Zhe Li, Song Bai, Chong Wang, Xinlei Chen, Mei Han, Elliot Fishman, and Alan L Yuille. Prioraware neural network for partiallysupervised multiorgan segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10672–10681, 2019.
Appendix A Related works
While there is, to the best of our knowledge, no other work attempting to supervise the shape of the predicted segmentation in the way we do—with direct losses that can be plugged on top of any existing network—there exists a few works that created custom architecture to regularize the shape of the predicted segmentation. They all require fullmask annotations, use a base crossentropy loss, and have been evaluated only on simpler binary settings.
ACNN [Oktay et al.(2017)Oktay, Ferrante, Kamnitsas, Heinrich, Bai, Caballero, Cook, De Marvao, Dawes, O‘Regan, et al.] trains an autoencoder with example annotations, to generate a shape embedding of the task at hand. The encoder is then used when training the main segmentation network, by minimizing (on top of the base crossentropy loss) the euclidean distance between the encoding of the predicted segmentation and the encoding of the (fully labeled) ground truth.
[Milletari et al.(2017)Milletari, Rothberg, Jia, and Sofka] integrate PCA values (computed from an annotated dataset) into a dedicated “PCAaware” CNN architecture, which is used to regularize the shape of the predicted segmentation. However, the heavilycustomized layers make it incompatible with existing FCN architectures, and the PCA precomputed values are hardly interpretable.
[Nguyen and Ray(2020)] integrates a “starshape” prior into their training on top of the base crossentropy loss. This was used to regularize the contour of ventricle segmentation in 2D slices, given a userprovided centroid. The original prior [Ray et al.(2012)Ray, Acton, and Zhang] was optimized using dynamic programming, which is not usable for deep neural networks. In that paper, the authors proved that it was possible to “learn” that dynamic programming part using an additional, trainable network module. However, to be useful, the method requires userprovided centroid even at inference.
Appendix B Extended logbarrier
The extended logbarrier was introduced in [Kervadec et al.(2019c)Kervadec, Dolz, Yuan, Desrosiers,
Granger, and Ayed], as standard Lagrangian or interiorpoint methods are not directly applicable to deep learnign settings.
If we take a simple constrained optimization setting:
_θ &L_θ
subject to & z ≤0 ,
then its extended logbarrier equivalent is:
_θ &L_θ + ~ψ_t(z)
~ψ_t(z) &= {1t log(z) & if z ≤1t2
tz  1t log(1t2) + 1t & otherwise,
where is the slope parameter of the logbarrier that is increased over time, eventually “closing” the barrier when . This is illustrated in Figure 3
The advantage of the logbarrier are twofold:

it allows to gradually increase the tightness of the constraints that we want to satisfy;

once satisfied, it gently pushes back the constrained function toward the feasible set, preventing it to go out of bounds.
Appendix C Training curves
The training curves shows that the trainig is fairly stable over time, though in the case of Promise12
it takes a few epochs for the network to start producing meaningful predictions. This is related, we think, to the random initialization procedure used in standard deep learning settings, which might not be the most optimal method when using different forms of supervision.
Appendix D Time independent 3D shape descriptors
As mentioned in Section 2.1, all shape descriptors can be extended quite easily to 3D. In the case of the ACDC dataset, this could be very powerful: as the training data comes from actual 4D CineMRI scans (3D images over time), with annotations for two time points at the beginning of the systole and diastole: when the heart is at its biggest and smallest, respectively.
By computing the descriptors at those two extremes, one could get patientwise (and not imagewise) upper and lower bounds for our descriptors, valid at any timepoint. Figure 5 shows one ellipsoid per class (based on the average distance to the centroid, , and centered around its centroid ). The thick lines represent the shift of the centroid between the two phases.
We can clearly see the “shrink” between the two phases, and the slight shift of the centroids toward the center of the scan. The bounds given by those two annotations would allows to constraint the remaining unannotated 3D volumes at train time; increasing the training set size by one order of magnitude. The benefits are clear, especially in lowdata settings.