1 Introduction
The ACM [12]
is one of the most influential computer vision techniques. It has been successfully employed in various image analysis tasks, including object segmentation and tracking. In most ACM variants the deformable curve(s) of interest dynamically evolves through an iterative procedure that minimizes a corresponding energy functional. Since the ACM is a modelbased formulation founded on geometric and physical principles, the segmentation process relies mainly on the content of the image itself, not on large annotated image datasets, extensive computational resources, and hours or days of training. However, the classic ACM relies on some degree of user interaction to specify the initial contour and tune the parameters of the energy functional, which undermines its applicability to the automated analysis of large quantities of images.
In recent years, Deep Neural Networks (DNNs) have become popular in many areas. In computer vision and medical image analysis, CNNs have been succesfully exploited for different segmentation tasks [6, 9, 17]. Despite their tremendous success, the performance of CNNs is still very dependent on their training datasets. In essence, CNNs rely on a filterbased learning scheme in which the weights of the network are usually tuned using a backpropagation error gradient decent approach. Since CNN architectures often include millions of trainable parameters, the training process relies on the sheer size of the dataset. In addition, CNNs usually generalize poorly to images that differ from those in the training datasets and they are vulnerable to adversarial examples [23]. For image segmentation, capturing the details of object boundaries and delineating them remains a challenging task even for the most promising of CNN architectures that have achieved stateoftheart performance on relevant benchmarked datasets [4, 10, 24]. The recently proposed Deeplabv3+ [5] has mitigated this problem to some extent by leveraging the power of dilated convolutions, but such improvements were made possible by extensive pretraining and vast computational resources—50 GPUs were reportedly used to train this model.
In this paper, we aim to bridge the gap between CNNs and ACMs by introducing a truly endtoend framework. Our framework leverages an automatically differentiable ACM with trainable parameters that allows for backpropagation of gradients. This ACM can be trained along with a backbone CNN from scratch and without any pretraining. Moreover, our ACM utilizes a locallypenalized energy functional that is directly predicted by its backbone CNN, in the form of 2D feature maps, and it is initialized directly by the CNN. Thus, our work alleviates one of the biggest obstacles to exploiting the power ACMs—eliminating the need for any type of user supervision or intervention.
As a challenging test case for our DCAC framework, we tackle the problem of building instance segmentation in aerial images. Our DCAC sets new stateoftheart benchmarks on the Vaihingen and Bing Huts datasets for building instance segmentation, outperforming its closest competitor by a wide margin.
2 Related Work
Eulerian active contours:
Eulerian active contours evolve the segmentation curve by dynamically propagating an implicit function so as to minimizing its associated energy functional [18]. The most notable approaches that utilize this formulation are the active contours without edges by Chan and Vese [3] and the geodesic active contours by Caselles et al. [2]. The CasellesKimmelSapiro model is mainly dependent on the location of the levelset, whereas the ChanVese model mainly relies on the content difference between the interior and exterior of the levelset. In addition, the work by [14] proposes a reformulation of the ChanVese model in which the energy functional incorporates image properties in local regions around the levelset, and it was shown to more accurately segment objects with heterogeneous features.
“EndtoEnd” CNNs with ACMs:
Several efforts have attempted to integrate CNNs with ACMs in an endtoend manner as opposed to utilizing the ACM merely as a postprocessor of the CNN output. Le et al. [15]
implemented levelset ACMs as Recurrent Neural Networks (RNNs) for the task of semantic segmentation of natural images. There exists 3 key differences between our proposed DCAC and this effort: (1) DCAC does not reformulate ACMs as RNNs and as a result is more computationally efficient. (2) DCAC benefits from a novel locallypenalized energy functional, whereas
[15] has constant weighted parameters. (3) DCAC has an entirely different pipeline—we employ a single CNN that is trained from scratch along with the ACM, whereas [15] requires two pretrained CNN backbones (one for object localization, the other for classification). The dependence of [15] on pretrained CNNs has limited its applicability. The other attempt, the DSAC model by Marcos et al. [16], is an integration of ACMs with CNNs in a structured prediction framework for building instance segmentation in aerial images. There are 3 key differences between DCAC and this work: (1) [16] heavily depends on the manual initialization of contours, whereas our DCAC is fully automated and runs without any external supervision. (2) The ACM used in [16] has a parametric formulation that can handle only a single building at a time, whereas our DCAC leverages the Eulerian ACM which can naturally handle multiple building instances simultaneously. (3) [16] requires the user to explicitly calculate the gradients, whereas our approach fully automates the direct backpropagation of gradients through the entire DCAC framework due to its automatically differetiable ACM.Building instance segmentation:
Modern CNNbased methods have been used with different approaches to the problem of building segmentation. Some efforts have treated this problem as a semantic segmentation problem [20, 22] and utilized postprocessing steps to extract the building boundaries. Other efforts have utilized instance segmentation networks [11] to directly predict the location of buildings.
3 Level Set Active Contours
First proposed by Osher and Sethian [19] to evolve wavefronts in CFD simulations, a levelset is an implicit representation of a hypersurface that is dynamically evolved according to the nonlinear HamiltonJacobi equation. In 2D, let be a closed timevarying contour represented in by the zero level set of the signed distance map . Function evolves according to
(1) 
where represents the initial level set.
We introduce a generalization of the levelset ACM proposed by Chan and Vese [3]. Their model assumes that the image of interest consists of two areas of distinct intensities. The interior of is represented by the smoothed Heaviside function
(2) 
and represents its exterior. The derivative of (2) is the smoothed Dirac delta function
(3) 
The energy functional associated with is written as
(4) 
where penalizes the length of and penalizes its enclosed area (we set and ), and where and are the mean image intensities inside and outside . We follow Lankton et al. [14] and define and as the mean image intensities inside and outside within a local window around .
Note that to afford greater control over , we have generalized the constants and used in [3] to parameter functions and in (4). The contour expands or shrinks at a certain location if or , respectively [7]. In DCAC, these parameter functions are trainable and learned directly by the backbone CNN. Fig.2 illustrates an example of these learned maps by the CNN.
4 CNN Backbone
As our CNN backbone, we follow [8] and utilize a fully convolutional encoderdecoder architecture with dilated residual blocks (Fig. 3
). Each convolutional layer is followed by a Rectified Linear Unit (ReLU) as the activation layer and a batch normalization. The dilated residual block consists of 2 consecutive dilated convolutional layers whose outputs are fused with its input and fed into the ReLU activation layer. In the encoder, each path consist of 2 consecutive
convolutional layers, followed by a dilated residual unit with a dilation rate of 2. Before being fed into the dilated residual unit, the output of these convolutional layers are added with the output feature maps of another 2 consecutive convolutional layers that learn additional multiscale information from the resized input image in that resolution. To recover the content lost in the learned feature maps during the encoding process, we utilize a series of consecutive dilated residual blocks with dilation rates of 1, 2, and 4 and feed the output to a dilated spatial pyramid pooling layer with 4 different dilation rates of 1, 6, 12 and 18. The decoder is connected to the dilated residual units at each resolution via skip connections, and in each path we upsample the image and employ 2 consecutive convolutional layers before proceeding to the next resolution. The output of the decoder is fed into another series of 2 consecutive convolutional layer and then passed into 3 separate convolutional layers for predicting the output maps of and as well as the distance transform.5 DCAC Architecture and Implementation
In our DCAC framework (Fig. 1), the CNN backbone serves to directly initialize the zero levelset contour as well as the weighted local parameters. We initialize the zero levelset by a learned distance transform that is directly predicted by the CNN along with additional convolutional layers that learn the parameter maps. Figure 2
illustrates an example of what the backbone CNN learns in the DCAC on one input image from the Vaihingen data set. These learned parameters are then passed to the ACM that unfolds for a certain number of timesteps in a differentiable manner. The final zero levelset is then converted to logits and compared with the label and the resulting error is backpropagated through the entire framework in order to tune the weights of the CNN backbone. Algorithm
1 presents the details of DCAC training algorithm.5.1 Implementation Details
All components of DCAC, including the ACM, have been implemented entirely in Tensorflow [1] and are compatible with both Tensorflow 1.x and 2.0 versions. The ACM implementation benefits from the automatic differentiation utility of Tensorflow and has been designed to enable the backpropagation of the error gradient through the layers of the ACM.
In each ACM layer, each point along the the zero levelset contour is probed by a local window and the mean intensity of the inside and outside regions; i.e., and in (4), are extracted. In our implementation, and
are extracted by using a differentiable global average pooling layer with appropriate padding not to lose any information on the edges.
All the training was performed on an Nvidia Titan XP GPU, and an Intel® Core™ i77700K CPU @ 4.20GHz. The size of the minibatches for training on the Vaihingen and Bing Huts datasets were 3 and 20 respectively. All the training sessions employ the Adam optimization algorithm [13]
with a learning rate of 0.001 that that decays by a factor of 10 every 10 epochs.
Dataset:  Vaihingen  Bing Huts  

Model  Dice  mIoU  WCov  BoundF  Dice  mIoU  WCov  BoundF 
DSAC  –  0.840  –  –  –  0.650  –  – 
UNet  0.810  0.797  0.843  0.622  0.710  0.740  0.852  0.421 
ResNet  0.801  0.791  0.841  0.770  0.81  0.797  0.864  0.434 
Backbone CNN  0.837  0.825  0.865  0.680  0.737  0.764  0.809  0.431 
DCAC: Single Inst  0.928  0.929  0.943  0.819  0.855  0.860  0.894  0.534 
DCAC: Multi Inst  0.908  0.893  0.910  0.797  0.797  0.809  0.839  0.491 
DCAC: Single Inst, Const  0.877  0.888  0.936  0.801  0.792  0.813  0.889  0.513 
DCAC: Multi Inst, Const  0.857  0.842  0.876  0.707  0.757  0.777  0.891  0.486 
6 Experiments
6.1 Datasets
Vaihingen:
The Vaihingen buildings dataset ^{1}^{1}1http://www2.isprs.org/commissions/comm3/wg4/2dsemlabelvaihingen.html consists of 168 building images of size pixels. The labels for each image are generated by using a semiautomated approach. We used 100 images for training and 68 for testing, following the same data split as in [16]. In this dataset, almost all images consist of multiple instances of buildings, some of which are located at the edges of the image.
Bing Huts:
The Bing Huts dataset ^{2}^{2}2https://www.openstreetmap.org/#map=4/38.00/95.80 consists of 605 images of size . We followed the same data split that is used in [16] and used 335 images for training and 270 images for testing. This dataset is especially challenging due the low spatial resolution and contrast that are exhibited in the images.
6.2 Evaluation Metrics and Loss Function
To evaluate our model’s performance, we utilized five different metrics—Dice, mean Intersection over Union (mIoU), Weighted Coverage (WCov), Boundary F (BoundF), and Root Mean Square Error (RMSE). The original DSAC paper only reported on mIoU for both Vaihingen and Bing Huts and only RMSE for the Bing Huts dataset. However, since the delineation of boundaries is one of the important goals of our framework, we employ the BoundF metric [21] to precisely measure the similarity between the specific boundary pixels in our predictions and the corresponding image labels. Furthermore, we used a soft Dice loss function in training our model.
7 Results and Discussion
7.1 Local and Fixed Weighted Parameters
To validate the contribution of the local weighted parameters in the levelset ACM, we also trained our DCAC on both the Vaihingen and Bing Huts datasets by only allowing one trainable scalar parameter for each of and , which is constant over the entire image. As presented in Table 1, in both the Vaihingen and Bing Huts datasets, this constant
formulation still outperforms the baseline CNN in all evaluation metrics for both singleinstance and multiinstance buildings, thus showing the effectiveness of the endtoend training of the DCAC. However, the DCAC with the full
and maps outperforms this constant formulation by a wide margin in all experiments and metrics.A key metric of interest in this comparison is the BoundF score, which demonstrates how our local formulation captures the details of the boundaries more effectively by adjusting the inward and outward forces on the contour locally. As illustrated in Figure 4, DCAC has perfectly delineated the boundaries of the building instances. However, DCAC with constant formulation has oversegmented these instances.
7.2 Buildings on the Edges of the Image
Our DCAC is capable of properly segmenting the instances of buildings located on the edges of some of the images present in the Vaihingen dataset. This is mainly due to the proper padding scheme that we have utilized in our global average pooling layer used to extract the local intensities of pixels while avoiding the loss of information on the boundaries.
7.3 Initialization and Number of ACM Iterations
In all cases, we performed our experiments with the goal of leveraging the CNN to fully automate the ACM and eliminate the need for any human supervision. Our scheme for learning a generalized distance transform directly helped us to localize all the building instances simultaneously and initialize the zero levelsets appropriately while avoiding a computationally expensive and nondifferentiable distance transform operation. In addition, initializing the zero levelsets in this manner, instead of the common practice of initializing from a circle, helped the contour to converge significantly faster and avoid undesirable local minima.
7.4 Comparison Against the DSAC Model
Although most of the images in the Vaihingen dataset consist of multiple instances of buildings, the DSAC model [16] can deal with only a single building at a time. For a fair comparison between the two approaches, we report separate metrics for a single building, as reported by in [16] for the DSAC, as well as for all the instances of buildings (which the DSAC cannot handle). As presented in Table 1, our DCAC outperforms DSAC by and percent in mIoU respectively on both the Vaihingen and Bing Huts datasets. Furthermore, the multipleinstance metrics of our DCAC outperform the singleinstance DSAC results. As demonstrated in Fig. 5, in the Vaihingen dataset, DSAC struggles in coping with the topological changes of the buildings and fails to appropriately capture sharp edges, while our framework in most cases handles these challenges. In the Bing Hut dataset, the DSAC is able to localize the buildings, but it mainly oversegments the buildings in many cases. This may be due to DSAC’s inability to distinguish the building from the surrounding soil because of the low contrast and small size of the image. By comparison, our DCAC is able to low contrast dataset well, with more accurate boundaries, when comparing the segmentation output of DSAC (b) and our DCAC (c), as seen in Fig. 5.
8 Conclusions and Future Work
We have introduced a novel image segmentation framework, called DCAC, which is a truly endtoend integration of ACMs and CNNs. We proposed a novel locallypenalized Eulerian energy model that allows for pixelwise learnable parameters that can adjust the contour to precisely capture and delineate the boundaries of objects of interest in the image. We have tackled the problem of building instance segmentation on two very challenging datasets of Vaihingen and Bing Huts as test case and our model outperforms the current stateoftheart method, DSAC. Unlike DSAC, which relies on the manual initialization of its ACM contour, our model requires minimal human supervision and is initialized and guided by its CNN backbone. Moreover, DSAC can only segment a single building at a time whereas our DCAC can segment multiple buildings simultaneously. We also showed that, unlike DSAC, our DCAC is effective in handling various topological changes in the image. Given the level of success that DCAC has achieved in this application and the fact that it features a general Eulerian formulation, it is readily applicable to other segmentation tasks in various domains where purely CNN filterbased approaches can benefit from the versatility and precision of ACMs in delineating object boundaries in images.
References

[1]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, et al.
Tensorflow: A system for largescale machine learning.
In OSDI, volume 16, pages 265–283, 2016.  [2] V. Caselles, R. Kimmel, and G. Sapiro. Geodesic active contours. International Journal of Computer Vision, 22(1):61–79, 1997.
 [3] T. F. Chan and L. A. Vese. Active contours without edges. IEEE Transactions on Image Processing, 10(2):266–277, 2001.
 [4] L.C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
 [5] L.C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoderdecoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611, 2018.
 [6] A. Hatamizadeh, S. P. Ananth, X. Ding, D. Terzopoulos, N. Tajbakhsh, et al. Automatic segmentation of pulmonary lobes using a progressive dense vnetwork. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pages 282–290. Springer, 2018.
 [7] A. Hatamizadeh, A. Hoogi, D. Sengupta, W. Lu, B. Wilcox, D. Rubin, and D. Terzopoulos. Deep active lesion segmentation. arXiv preprint arXiv:1908.06933, 2019.
 [8] A. Hatamizadeh, H. Hosseini, Z. Liu, S. D. Schwartz, and D. Terzopoulos. Deep dilated convolutional nets for the automatic segmentation of retinal vessels. arXiv preprint arXiv:1905.12120, 2019.
 [9] A. Hatamizadeh, D. Terzopoulos, and A. Myronenko. Endtoend boundary aware networks for medical image segmentation. arXiv preprint arXiv:1908.08071, 2019.
 [10] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.

[11]
V. Iglovikov, S. Seferbekov, A. Buslaev, and A. Shvets.
Ternausnetv2: Fully convolutional network for instance segmentation.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops
, June 2018.  [12] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. International Journal of Computer Vision, 1(4):321–331, 1988.
 [13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [14] S. Lankton and A. Tannenbaum. Localizing regionbased active contours. IEEE Transactions on Image Processing, 17(11):2029–2039, 2008.
 [15] T. H. N. Le, K. G. Quach, K. Luu, C. N. Duong, and M. Savvides. Reformulating level sets as deep recurrent neural network approach to semantic segmentation. IEEE Transactions on Image Processing, 27(5):2393–2407, 2018.
 [16] D. Marcos, D. Tuia, B. Kellenberger, L. Zhang, M. Bai, R. Liao, and R. Urtasun. Learning deep structured active contours endtoend. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8877–8885, 2018.
 [17] A. Myronenko and A. Hatamizadeh. 3d kidneys and kidney tumor semantic segmentation using boundaryaware networks. arXiv preprint arXiv:1909.06684, 2019.
 [18] S. Osher and R. P. Fedkiw. Level set methods: An overview and some recent results. Journal of Computational Physics, 169(2):463–502, 2001.
 [19] S. Osher and J. A. Sethian. Fronts propagating with curvaturedependent speed: algorithms based on hamiltonjacobi formulations. Journal of computational physics, 79(1):12–49, 1988.
 [20] S. Paisitkriangkrai, J. Sherrah, P. Janney, V.D. Hengel, et al. Effective semantic pixel labelling with convolutional networks and conditional random fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 36–43, 2015.
 [21] F. Perazzi, J. PontTuset, B. McWilliams, L. Van Gool, M. Gross, and A. SorkineHornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 724–732, 2016.
 [22] S. Wang, M. Bai, G. Mattyus, H. Chu, W. Luo, B. Yang, J. Liang, J. Cheverie, S. Fidler, and R. Urtasun. Torontocity: Seeing the world with a million eyes. arXiv preprint arXiv:1612.00423, 2016.
 [23] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991, 2017.
 [24] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2881–2890, 2017.
Comments
There are no comments yet.