The ACM 
is one of the most influential computer vision techniques. It has been successfully employed in various image analysis tasks, including object segmentation and tracking. In most ACM variants the deformable curve(s) of interest dynamically evolves through an iterative procedure that minimizes a corresponding energy functional. Since the ACM is a model-based formulation founded on geometric and physical principles, the segmentation process relies mainly on the content of the image itself, not on large annotated image datasets, extensive computational resources, and hours or days of training. However, the classic ACM relies on some degree of user interaction to specify the initial contour and tune the parameters of the energy functional, which undermines its applicability to the automated analysis of large quantities of images.
In recent years, Deep Neural Networks (DNNs) have become popular in many areas. In computer vision and medical image analysis, CNNs have been succesfully exploited for different segmentation tasks [6, 9, 17]. Despite their tremendous success, the performance of CNNs is still very dependent on their training datasets. In essence, CNNs rely on a filter-based learning scheme in which the weights of the network are usually tuned using a back-propagation error gradient decent approach. Since CNN architectures often include millions of trainable parameters, the training process relies on the sheer size of the dataset. In addition, CNNs usually generalize poorly to images that differ from those in the training datasets and they are vulnerable to adversarial examples . For image segmentation, capturing the details of object boundaries and delineating them remains a challenging task even for the most promising of CNN architectures that have achieved state-of-the-art performance on relevant bench-marked datasets [4, 10, 24]. The recently proposed Deeplabv3+  has mitigated this problem to some extent by leveraging the power of dilated convolutions, but such improvements were made possible by extensive pre-training and vast computational resources—50 GPUs were reportedly used to train this model.
In this paper, we aim to bridge the gap between CNNs and ACMs by introducing a truly end-to-end framework. Our framework leverages an automatically differentiable ACM with trainable parameters that allows for back-propagation of gradients. This ACM can be trained along with a backbone CNN from scratch and without any pre-training. Moreover, our ACM utilizes a locally-penalized energy functional that is directly predicted by its backbone CNN, in the form of 2D feature maps, and it is initialized directly by the CNN. Thus, our work alleviates one of the biggest obstacles to exploiting the power ACMs—eliminating the need for any type of user supervision or intervention.
As a challenging test case for our DCAC framework, we tackle the problem of building instance segmentation in aerial images. Our DCAC sets new state-of-the-art benchmarks on the Vaihingen and Bing Huts datasets for building instance segmentation, outperforming its closest competitor by a wide margin.
2 Related Work
Eulerian active contours:
Eulerian active contours evolve the segmentation curve by dynamically propagating an implicit function so as to minimizing its associated energy functional . The most notable approaches that utilize this formulation are the active contours without edges by Chan and Vese  and the geodesic active contours by Caselles et al. . The Caselles-Kimmel-Sapiro model is mainly dependent on the location of the level-set, whereas the Chan-Vese model mainly relies on the content difference between the interior and exterior of the level-set. In addition, the work by  proposes a reformulation of the Chan-Vese model in which the energy functional incorporates image properties in local regions around the level-set, and it was shown to more accurately segment objects with heterogeneous features.
“End-to-End” CNNs with ACMs:
Several efforts have attempted to integrate CNNs with ACMs in an end-to-end manner as opposed to utilizing the ACM merely as a post-processor of the CNN output. Le et al. 
implemented level-set ACMs as Recurrent Neural Networks (RNNs) for the task of semantic segmentation of natural images. There exists 3 key differences between our proposed DCAC and this effort: (1) DCAC does not reformulate ACMs as RNNs and as a result is more computationally efficient. (2) DCAC benefits from a novel locally-penalized energy functional, whereas has constant weighted parameters. (3) DCAC has an entirely different pipeline—we employ a single CNN that is trained from scratch along with the ACM, whereas  requires two pre-trained CNN backbones (one for object localization, the other for classification). The dependence of  on pre-trained CNNs has limited its applicability. The other attempt, the DSAC model by Marcos et al. , is an integration of ACMs with CNNs in a structured prediction framework for building instance segmentation in aerial images. There are 3 key differences between DCAC and this work: (1)  heavily depends on the manual initialization of contours, whereas our DCAC is fully automated and runs without any external supervision. (2) The ACM used in  has a parametric formulation that can handle only a single building at a time, whereas our DCAC leverages the Eulerian ACM which can naturally handle multiple building instances simultaneously. (3)  requires the user to explicitly calculate the gradients, whereas our approach fully automates the direct back-propagation of gradients through the entire DCAC framework due to its automatically differetiable ACM.
Building instance segmentation:
Modern CNN-based methods have been used with different approaches to the problem of building segmentation. Some efforts have treated this problem as a semantic segmentation problem [20, 22] and utilized post-processing steps to extract the building boundaries. Other efforts have utilized instance segmentation networks  to directly predict the location of buildings.
3 Level Set Active Contours
First proposed by Osher and Sethian  to evolve wavefronts in CFD simulations, a level-set is an implicit representation of a hypersurface that is dynamically evolved according to the nonlinear Hamilton-Jacobi equation. In 2D, let be a closed time-varying contour represented in by the zero level set of the signed distance map . Function evolves according to
where represents the initial level set.
We introduce a generalization of the level-set ACM proposed by Chan and Vese . Their model assumes that the image of interest consists of two areas of distinct intensities. The interior of is represented by the smoothed Heaviside function
and represents its exterior. The derivative of (2) is the smoothed Dirac delta function
The energy functional associated with is written as
where penalizes the length of and penalizes its enclosed area (we set and ), and where and are the mean image intensities inside and outside . We follow Lankton et al.  and define and as the mean image intensities inside and outside within a local window around .
Note that to afford greater control over , we have generalized the constants and used in  to parameter functions and in (4). The contour expands or shrinks at a certain location if or , respectively . In DCAC, these parameter functions are trainable and learned directly by the backbone CNN. Fig.2 illustrates an example of these learned maps by the CNN.
4 CNN Backbone
). Each convolutional layer is followed by a Rectified Linear Unit (ReLU) as the activation layer and a batch normalization. The dilated residual block consists of 2 consecutive dilated convolutional layers whose outputs are fused with its input and fed into the ReLU activation layer. In the encoder, each path consist of 2 consecutiveconvolutional layers, followed by a dilated residual unit with a dilation rate of 2. Before being fed into the dilated residual unit, the output of these convolutional layers are added with the output feature maps of another 2 consecutive convolutional layers that learn additional multi-scale information from the resized input image in that resolution. To recover the content lost in the learned feature maps during the encoding process, we utilize a series of consecutive dilated residual blocks with dilation rates of 1, 2, and 4 and feed the output to a dilated spatial pyramid pooling layer with 4 different dilation rates of 1, 6, 12 and 18. The decoder is connected to the dilated residual units at each resolution via skip connections, and in each path we up-sample the image and employ 2 consecutive convolutional layers before proceeding to the next resolution. The output of the decoder is fed into another series of 2 consecutive convolutional layer and then passed into 3 separate convolutional layers for predicting the output maps of and as well as the distance transform.
5 DCAC Architecture and Implementation
In our DCAC framework (Fig. 1), the CNN backbone serves to directly initialize the zero level-set contour as well as the weighted local parameters. We initialize the zero level-set by a learned distance transform that is directly predicted by the CNN along with additional convolutional layers that learn the parameter maps. Figure 2
illustrates an example of what the backbone CNN learns in the DCAC on one input image from the Vaihingen data set. These learned parameters are then passed to the ACM that unfolds for a certain number of timesteps in a differentiable manner. The final zero level-set is then converted to logits and compared with the label and the resulting error is back-propagated through the entire framework in order to tune the weights of the CNN backbone. Algorithm1 presents the details of DCAC training algorithm.
5.1 Implementation Details
All components of DCAC, including the ACM, have been implemented entirely in Tensorflow  and are compatible with both Tensorflow 1.x and 2.0 versions. The ACM implementation benefits from the automatic differentiation utility of Tensorflow and has been designed to enable the back-propagation of the error gradient through the layers of the ACM.
In each ACM layer, each point along the the zero level-set contour is probed by a local window and the mean intensity of the inside and outside regions; i.e., and in (4), are extracted. In our implementation, and
are extracted by using a differentiable global average pooling layer with appropriate padding not to lose any information on the edges.
All the training was performed on an Nvidia Titan XP GPU, and an Intel® Core™ i7-7700K CPU @ 4.20GHz. The size of the minibatches for training on the Vaihingen and Bing Huts datasets were 3 and 20 respectively. All the training sessions employ the Adam optimization algorithm 
with a learning rate of 0.001 that that decays by a factor of 10 every 10 epochs.
|DCAC: Single Inst||0.928||0.929||0.943||0.819||0.855||0.860||0.894||0.534|
|DCAC: Multi Inst||0.908||0.893||0.910||0.797||0.797||0.809||0.839||0.491|
|DCAC: Single Inst, Const||0.877||0.888||0.936||0.801||0.792||0.813||0.889||0.513|
|DCAC: Multi Inst, Const||0.857||0.842||0.876||0.707||0.757||0.777||0.891||0.486|
The Vaihingen buildings dataset 111http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-vaihingen.html consists of 168 building images of size pixels. The labels for each image are generated by using a semi-automated approach. We used 100 images for training and 68 for testing, following the same data split as in . In this dataset, almost all images consist of multiple instances of buildings, some of which are located at the edges of the image.
The Bing Huts dataset 222https://www.openstreetmap.org/#map=4/38.00/-95.80 consists of 605 images of size . We followed the same data split that is used in  and used 335 images for training and 270 images for testing. This dataset is especially challenging due the low spatial resolution and contrast that are exhibited in the images.
6.2 Evaluation Metrics and Loss Function
To evaluate our model’s performance, we utilized five different metrics—Dice, mean Intersection over Union (mIoU), Weighted Coverage (WCov), Boundary F (BoundF), and Root Mean Square Error (RMSE). The original DSAC paper only reported on mIoU for both Vaihingen and Bing Huts and only RMSE for the Bing Huts dataset. However, since the delineation of boundaries is one of the important goals of our framework, we employ the BoundF metric  to precisely measure the similarity between the specific boundary pixels in our predictions and the corresponding image labels. Furthermore, we used a soft Dice loss function in training our model.
7 Results and Discussion
7.1 Local and Fixed Weighted Parameters
To validate the contribution of the local weighted parameters in the level-set ACM, we also trained our DCAC on both the Vaihingen and Bing Huts datasets by only allowing one trainable scalar parameter for each of and , which is constant over the entire image. As presented in Table 1, in both the Vaihingen and Bing Huts datasets, this constant-
formulation still outperforms the baseline CNN in all evaluation metrics for both single-instance and multi-instance buildings, thus showing the effectiveness of the end-to-end training of the DCAC. However, the DCAC with the fulland maps outperforms this constant formulation by a wide margin in all experiments and metrics.
A key metric of interest in this comparison is the BoundF score, which demonstrates how our local formulation captures the details of the boundaries more effectively by adjusting the inward and outward forces on the contour locally. As illustrated in Figure 4, DCAC has perfectly delineated the boundaries of the building instances. However, DCAC with constant formulation has over-segmented these instances.
7.2 Buildings on the Edges of the Image
Our DCAC is capable of properly segmenting the instances of buildings located on the edges of some of the images present in the Vaihingen dataset. This is mainly due to the proper padding scheme that we have utilized in our global average pooling layer used to extract the local intensities of pixels while avoiding the loss of information on the boundaries.
7.3 Initialization and Number of ACM Iterations
In all cases, we performed our experiments with the goal of leveraging the CNN to fully automate the ACM and eliminate the need for any human supervision. Our scheme for learning a generalized distance transform directly helped us to localize all the building instances simultaneously and initialize the zero level-sets appropriately while avoiding a computationally expensive and non-differentiable distance transform operation. In addition, initializing the zero level-sets in this manner, instead of the common practice of initializing from a circle, helped the contour to converge significantly faster and avoid undesirable local minima.
7.4 Comparison Against the DSAC Model
Although most of the images in the Vaihingen dataset consist of multiple instances of buildings, the DSAC model  can deal with only a single building at a time. For a fair comparison between the two approaches, we report separate metrics for a single building, as reported by in  for the DSAC, as well as for all the instances of buildings (which the DSAC cannot handle). As presented in Table 1, our DCAC outperforms DSAC by and percent in mIoU respectively on both the Vaihingen and Bing Huts datasets. Furthermore, the multiple-instance metrics of our DCAC outperform the single-instance DSAC results. As demonstrated in Fig. 5, in the Vaihingen dataset, DSAC struggles in coping with the topological changes of the buildings and fails to appropriately capture sharp edges, while our framework in most cases handles these challenges. In the Bing Hut dataset, the DSAC is able to localize the buildings, but it mainly over-segments the buildings in many cases. This may be due to DSAC’s inability to distinguish the building from the surrounding soil because of the low contrast and small size of the image. By comparison, our DCAC is able to low contrast dataset well, with more accurate boundaries, when comparing the segmentation output of DSAC (b) and our DCAC (c), as seen in Fig. 5.
8 Conclusions and Future Work
We have introduced a novel image segmentation framework, called DCAC, which is a truly end-to-end integration of ACMs and CNNs. We proposed a novel locally-penalized Eulerian energy model that allows for pixel-wise learnable parameters that can adjust the contour to precisely capture and delineate the boundaries of objects of interest in the image. We have tackled the problem of building instance segmentation on two very challenging datasets of Vaihingen and Bing Huts as test case and our model outperforms the current state-of-the-art method, DSAC. Unlike DSAC, which relies on the manual initialization of its ACM contour, our model requires minimal human supervision and is initialized and guided by its CNN backbone. Moreover, DSAC can only segment a single building at a time whereas our DCAC can segment multiple buildings simultaneously. We also showed that, unlike DSAC, our DCAC is effective in handling various topological changes in the image. Given the level of success that DCAC has achieved in this application and the fact that it features a general Eulerian formulation, it is readily applicable to other segmentation tasks in various domains where purely CNN filter-based approaches can benefit from the versatility and precision of ACMs in delineating object boundaries in images.
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, et al.
Tensorflow: A system for large-scale machine learning.In OSDI, volume 16, pages 265–283, 2016.
-  V. Caselles, R. Kimmel, and G. Sapiro. Geodesic active contours. International Journal of Computer Vision, 22(1):61–79, 1997.
-  T. F. Chan and L. A. Vese. Active contours without edges. IEEE Transactions on Image Processing, 10(2):266–277, 2001.
-  L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
-  L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611, 2018.
-  A. Hatamizadeh, S. P. Ananth, X. Ding, D. Terzopoulos, N. Tajbakhsh, et al. Automatic segmentation of pulmonary lobes using a progressive dense v-network. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pages 282–290. Springer, 2018.
-  A. Hatamizadeh, A. Hoogi, D. Sengupta, W. Lu, B. Wilcox, D. Rubin, and D. Terzopoulos. Deep active lesion segmentation. arXiv preprint arXiv:1908.06933, 2019.
-  A. Hatamizadeh, H. Hosseini, Z. Liu, S. D. Schwartz, and D. Terzopoulos. Deep dilated convolutional nets for the automatic segmentation of retinal vessels. arXiv preprint arXiv:1905.12120, 2019.
-  A. Hatamizadeh, D. Terzopoulos, and A. Myronenko. End-to-end boundary aware networks for medical image segmentation. arXiv preprint arXiv:1908.08071, 2019.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
V. Iglovikov, S. Seferbekov, A. Buslaev, and A. Shvets.
Ternausnetv2: Fully convolutional network for instance segmentation.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
-  M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. International Journal of Computer Vision, 1(4):321–331, 1988.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  S. Lankton and A. Tannenbaum. Localizing region-based active contours. IEEE Transactions on Image Processing, 17(11):2029–2039, 2008.
-  T. H. N. Le, K. G. Quach, K. Luu, C. N. Duong, and M. Savvides. Reformulating level sets as deep recurrent neural network approach to semantic segmentation. IEEE Transactions on Image Processing, 27(5):2393–2407, 2018.
-  D. Marcos, D. Tuia, B. Kellenberger, L. Zhang, M. Bai, R. Liao, and R. Urtasun. Learning deep structured active contours end-to-end. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8877–8885, 2018.
-  A. Myronenko and A. Hatamizadeh. 3d kidneys and kidney tumor semantic segmentation using boundary-aware networks. arXiv preprint arXiv:1909.06684, 2019.
-  S. Osher and R. P. Fedkiw. Level set methods: An overview and some recent results. Journal of Computational Physics, 169(2):463–502, 2001.
-  S. Osher and J. A. Sethian. Fronts propagating with curvature-dependent speed: algorithms based on hamilton-jacobi formulations. Journal of computational physics, 79(1):12–49, 1988.
-  S. Paisitkriangkrai, J. Sherrah, P. Janney, V.-D. Hengel, et al. Effective semantic pixel labelling with convolutional networks and conditional random fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 36–43, 2015.
-  F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 724–732, 2016.
-  S. Wang, M. Bai, G. Mattyus, H. Chu, W. Luo, B. Yang, J. Liang, J. Cheverie, S. Fidler, and R. Urtasun. Torontocity: Seeing the world with a million eyes. arXiv preprint arXiv:1612.00423, 2016.
-  C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991, 2017.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2881–2890, 2017.