## 1 Introduction

The ability to automatically extract building footprints from aerial imagery is an important task in remote sensing. It has many applications such as cartography, urban planning, and humanitarian aid. While maps in well-established urban areas provide precise definitions of building outlines, in more general situations, this information may be neither up-to-date nor available altogether. Such is the case in remote population centers and in dynamic scenarios caused by rapid urban development or natural disasters. These situations motivate the use of automatic building segmentation to form an understanding of an area.

Convolutional neural networks (CNNs) have rapidly established themselves as the de facto standard for tasks of semantic and instance segmentation, as demonstrated by their impressive performance across a variety of datasets [25, 7, 13]. When applied to the task of building segmentation, however, there exists room for improvement. Specifically, as demonstrated in [22], while CNNs are able to generate dense, pixel-based predictions with high recall, they have issues with precise delineation of building boundaries.

Arguably, these boundaries are the most useful features in defining the shape and location of a building. Motivated by this, Marcos *et al*. proposed the deep structured active contours (DSAC) [15] model, which combines the power of deep CNNs with the classic polygon-based active contour model of Kass *et al*. [11]. This fits in particularly well with our prior understanding of buildings, which are generally laid out as relatively simple polygons. DSAC uses a deep CNN to predict an energy landscape on which the active contour, also known as a snake, crawls.
When the energy reaches the minimum, the snake stops and the enclosed region is the final predicted segmentation.

Although DSAC’s contours exhibit improved coverage compared to CNN based segmentation, they can still be blob-like without strict adherence to building boundaries. Additionally, the representation of contour points with two degrees of freedom presents some challenges. Most notably, it results in extra computational overhead to minimize the proposed energy, and also allows for some contours to exhibit self-intersections. To address these limitations, we introduce the Deep Active Ray Network (DARNet), a framework based on a polar representation of active contours, traditionally known as active rays. This ray-based parameterization provides several advantages: 1) contour self-intersections are completely eliminated; 2) it allows us to propose a simpler energy function; 3) the parameterization lends itself to a new loss function that encourages contours to match building boundaries. Our DARNet also exploits a deep CNN as the backbone network for predicting the energy landscape. We train the whole model end-to-end by back-propagating through the energy minimization and the backbone CNN. We compare DARNet against DSAC on three datasets, Vaihingen, Bing Huts, and TorontoCity, and demonstrate its improved effectiveness in producing segmentations that better align with building boundaries.

## 2 Related Work

Active Contour Models: First introduced in [11] by the name of snakes, active contour models proved to be an extremely popular approach to image segmentation. In these models, an energy is defined as a functional, and its minimization yields a contour that describes a segmentation. The description of this energy is based on intuitive geometric priors and leverages features such as the intensity image and its gradient. Of the myriad works that followed, [5] proposed a balloon force to avoid the tendency for snakes to collapse when initialized far from object edges. [6] reformulated snakes in a polar representation known as active rays to reduce the energy optimization from two dimensions to one, and addressed the issue of contour collapse by introducing energies that encouraged circular shapes. Our approach leverages the parameterization of active rays, but we elect to use the curvature term proposed in snakes, as our application of interest does not typically contain circular boundaries. Furthermore, we propose a novel balloon energy that does not involve computing normals at every point in our contour, but rather exploits the properties of a polar parameterization to achieve desired effect in a manner that can be efficiently computed, and fits in seamlessly with our contour inference.

Deep Active Contour Models: [18]

proposed to combine deep learning with the active contours framework of

[20] by having a CNN predict a vector field for a patch around a given contour point to guide it to the closest boundary. However, this method is unable to be learned end-to-end, as CNN learning is separated from the active contour inference. [9, 23] propose to combine a CNN with a level set method [16] in an end-to-end differentiable manner. In contrast to level sets, which define contours implicitly, snakes provide an explicit representation of contour points allowing for the definition of energies based on geometric intuition. [15]uses the snakes framework and replaces the engineered features with ones learned by a CNN. The problem is posed under the formulation of structured output prediction, and the CNN is trained using a structured support vector machine (SSVM) hinge loss

[21] to optimize for intersection-over-union. In contrast, we propose to use an active rays parameterization alongside a largely simplified energy functional. We also propose a loss function that encourages sharper, better aligned contours. Additionally, we back-propagate our loss through the contour evolution,*i.e*., the energy minimization. It is interesting to note that there are other deep learning based approaches for predicting polygon-based contours. For example, rather than representing a polygon as a discretization of some continuous function, [4, 2]

use a recurrent neural network (RNN) to directly predict the polygon vertices in a sequential manner. In

[12], the authors predict the polygon or spline outlining the object using a Graph Convolutional Network.Building Segmentation: Current state-of-the-art approaches to building segmentation typically incorporate CNNs in a two stage pipeline: identify instances, and extract polygons. As shown in [22], instances can be extracted from connected components of the semantic mask predicted by a semantic segmentation network [14, 8], or directly predicted with an instance segmentation network [3]. A polygon is then extracted from the instance mask. Because the output space of these approaches are individual pixels, the networks do not reason about the geometry of its predictions, resulting in segmentations that are blob-like around building boundaries. In contrast, the concept of a polygonal output is directly embedded in our model, where we can encourage our outputs to match ground truth shapes.

## 3 Our Approach

In this section, we introduce our DARNet model. Given an input image, our CNN predicts feature maps that define an energy landscape. We place a polygon-based contour, parameterized using polar coordinates, at an initial position on this landscape. This contour then evolves to minimize its energy via gradient descent, and its resting position defines the predicted instance segmentation. We refer the reader to Figure 1 for an illustration of our model. In the subsequent sections, we first describe our parametrization of the contour, highlighting its advantages, such as avoiding self-intersections. We then introduce our energy formulation, in particular our novel balloon energy, and the contour evolution process. Last, we explain how to learn our model in an end-to-end manner.

### 3.1 Contour Representation

Recall that in the active contour (or snake) model a contour point is represented as

(1) |

where denotes the arc length and is generally defined over an interval . Note that by varying the arc length from to , we obtain the contour. Since this parameterization adopts separate functions for and coordinates, the contour point is free to move in any direction, which may cause self-intersection as shown in Figure 3. In contrast, in this paper, we propose to use a parametrization that implicitly avoids self-intersection.

#### Active Rays:

Inspired by [6], we parameterize the contour via rays. In particular, we define a contour point as

(2) |

where define the reference point of the contour, is the radius and is the angle tracing out from the x-axis and ranging . We assume to be fixed within the interior of the object of interest. To ease the computation, we discretize the contour as a sequence of points . The discretization is chosen such that points have equal angular spacing,

(3) |

where and . The above ray based parameterization is called active rays. Importantly, if the region enclosed by the contour forms a convex set, we can guarantee that for any interior reference point, given any angle , there is only one corresponding radius based on the following proposition.

###### Proposition 1.

Given a closed convex set , a ray starting from any interior point of will intersect with the boundary of once.

We leave the proof to the supplementary material. If the region is non-convex, a ray may possibly have more than one intersecting point with the boundary. In that case, we pick the one with the minimum distance from the reference point, thus eliminating the possibility that there are multiple corresponding to the same angle . Therefore, compared to snakes, an active rays parameterization avoids self-intersections as shown in Figure 3. Moreover, since we fix the angle at which rays can emanate, the contour points possess an inherent order. Such ordering does not exist for snakes; for example, any cyclic permutation of the snake points produces an identical contour. As we see in Section 3.4, this allows us to naturally use a loss function that encourages our contours to match building boundaries.

#### Multiple Sets of Active Rays:

Note that active rays largely preclude contours that enclose non-convex regions. While this is not the dominating case in our application domain, we would like to create a solution that can handle non-convex shapes. Towards this goal, we propose to use multiple sets of active rays, where each set has its own fixed reference point. First, we exploit an instance segmentation to generate a segment over the region of interest (RoI). We use the method in [3] as it tends to under segment the RoIs, thus largely guaranteeing our reference point to lie in the interior. Using this segment, we calculate a distance transform on the mask, and select the location of the largest value as the reference point of our initial contour. If this evolved contour cannot cover the whole segment, we then repeat the process using the distance transform of the uncovered regions, until we cover the whole segment. This is illustrated in Figure 4. The union of evolved contours is used to construct the final prediction.

### 3.2 Contour Energy

We now introduce the formulation of the contour energy functional, of which the minimization yields the final contour. In particular, we encourage the contour to follow boundaries, prefer low-curvature solutions, and expand outwards from a small initialization. With the aforementioned discretization, the overall energy is defined as follows,

(4) |

Unlike traditional active contour models, we parameterize the energy with the help of a backbone CNN. The hope is that the great power of representation learning of CNN can make the contour evolution more accurate and efficient. We use a CNN architecture based on Dilated Residual Networks [24]

. Specifically, we use DRN-D-22, with weights pretrained on ImageNet

[19], and append additional learned upsampling layers using transposed convolutions. At the last layer, we predict three outputs that match the input image size, corresponding to three maps, which we denote as . We now describe each energy term in detail.#### Data Term:

Given an input image, the data term is defined as

(5) |

where is a non-negative feature map output by the backbone CNN. Note that and all subsequently described feature maps are of the same shape as the input image. Since we are minimizing the energy, the contour should seek out places where is low. Therefore, the CNN should ideally predict low values at the boundaries.

#### Curvature Term:

Intuitively, this term models the resistance of the contour towards bending as follows

(6) |

where is a non-negative feature map output by the backbone CNN and the squared term is a discrete approximation of the second order derivative of the contour. This term is flexible since the weighting scheme induced by can make the energy locally adaptive. This energy term will force the contour to straighten out wherever the value is high. Our curvature term is simpler compared to the one in DSAC [15], where in order to prevent snake points from clustering too close together in low-energy areas, they employ an additional membrane term based on the first order derivative of the contour. In contrast, we do not need such a term as our evenly spaced angles guarantee that contour points will not group too closely.

#### Balloon Term:

Our balloon term is defined by

(7) |

where is a non-negative feature map output by the backbone CNN and is the maximum radius a contour can reach without crossing the image boundary. The balloon term is designed to propel a given contour point outwards from its reference point, conditioned on the value of the underlying map. It is necessary due to two reasons. First, the data term may be insufficient to guide the contour towards the boundaries, especially if the contour was initialized far away from them. Second, as noted in [6], the curvature term has an auxiliary effect of shrinking the contour, which can lead to its collapse.

It is interesting to note that DSAC [15] also employs a balloon term which can be expressed using our notation as below,

(8) |

where denotes the area enclosed by the contour . This term pushes the contour to encapsulate as much of the map as possible. In our case, due to the active ray parameterization, a contour point can only move along one axis, either towards the reference point or away. Therefore, our balloon term is much simpler as it avoids the need to perform an area integral. Also, as we will see in the following section, our balloon term fits in seamlessly with our inference procedure.

### 3.3 Contour Evolution

Conditioned on the energy terms predicted by the backbone CNN, the second inference step is achieved through energy minimization. To evolve the initial contour towards the optimal one, we first derive the partial derivatives and set them to zero. We then resort to an iterative algorithm to solve the system of partial derivatives.

Specifically, the partial derivatives of the energy terms w.r.t. the contour are derived as below. For the data term,

(9) |

where we change the coordinates back to Cartesian to facilitate the computation of derivatives, *e.g*. with a Sobel filter.

For the curvature term, substituting the Cartesian expression of contour points (Equation 3) into the expressions for the energy (Equation 6 and Equation 4), we have,

(10) |

where we discard the term arising from the product rule of differentiation as in [15]. We interpret this approximation as treating the map as not varying within the small vicinity of the contour points. Alternatively, we do not wish for the gradient of the map to exert pressure on the contour. Empirically, we found that doing so stabilizes learning, as the network only needs to adjust values of the map without attention to its gradients.

For the balloon term, we use the same approach as the curvature term to obtain the partial derivative,

(11) |

With the above derivation, we have a collection of partial differential equations w.r.t. individual contour points. We can summarize this system of equations in a compact matrix form,

(12) |

where is a column vector of size , is an cyclic pentadiagonal matrix comprised of derivatives, and is a column vector comprised of and derivatives.

This system of partial differential equations can be solved with an iterative method. The approach taken by [11] and [15] is an implicit-explicit method. For the purposes of our implementation, we adopt an explicit method instead, as it avoids the matrix inverse operation. Specifically, the contour evolves according to

(13) |

where

is a time step hyperparameter. In practice, we found setting

as is stable enough for solving the system.### 3.4 Learning

Since there exists an explicit ordering of the contour points, we can naturally generate a ground truth active ray and use it to supervise the prediction. Using the same reference point and angle discretization , we cast rays outwards and record the distances at which they intersect the ground truth polygon. In the case of multiple intersections, we take the smallest distance, to prioritize hitting the correct boundaries over increasing coverage. This is illustrated in Figure 5. We use this collection of ground truth distances, , to compute an loss:

(14) |

It differs from the loss employed by DSAC, which used a SSVM hinge loss to optimize for intersection-over-union. Instead, our loss encourages contour points to target building boundaries which is simpler and more efficient.

To allow for gradients to backpropagate to the

, , and maps, we interpret the value of a given contour point as a floating point number, and compute the value of a map at that point (*e.g*.

) using bilinear interpolation from its four adjacent map entries, in a manner similar to what is used in Spatial Transformer Networks

[10]. We summarize the learning process in Algorithm 1.## 4 Experiments

### 4.1 Experimental Setup

#### Datasets:

We evaluate DARNet on several building instance segmentation datasets: Vaihingen [1], Bing Huts [15], and TorontoCity [22]. The Vaihingen dataset consists primarily of detached buildings in a town in southern Germany. The original images are at a resolution of cm/pixel. There are buildings in total, which are divided into examples for train/test according to the same split used in [15]. The Bing Huts dataset consists of huts located in a rural area of Tanzania, with an original size of at a resolution of cm/pixel. There are images in total, divided into examples for train/test, again using the same splits. For these two datasets, we further partition the existing training data in an split to form a validation set, and use this for model selection. We do this for 5 disjoint validation sets, while the test set remains the same. The TorontoCity dataset contains aerial images captured over a dense urban area in Toronto. The images used have a resolution of cm/pixel. The dataset consists of approximately images for train/test which covers a diverse mixture of buildings, including residential, commercial, and industrial. We divide the training set of TorontoCity into a split for training and validation respectively.

#### Metrics:

We measure performance using intersection-over-union (IoU) averaged over number of instances, weighted coverage, and polygon similarity as in [22]

. Additionally, we evaluate the boundary F-score (BoundF) introduced in

[17], averaged over thresholds from topixels, inclusive. For Vaihingen and Bing Huts, we aggregate these metrics over all models selected with the various validation sets, measuring their mean and standard deviation. Lastly, for TorontoCity, we also evaluate the quality of alignment for the predicted boundaries. Specifically, we gather predicted contour pixels that match with the ground truth boundaries, within a threshold of 5 pixels. For these matches, we evaluate the alignment error with respect to the ground truth, which is determined as the cosine similarity between the ground truth boundary and the predicted boundary. We then rank these pixels by their alignment error, and plot the recall at various thresholds of this error.

#### Hyper-parameters:

We discretize our contour with points. For training, we use SGD with momentum, with learning rate , momentum , and a batch size of . The learning rate decay schedule is explained in supplementary material. We perform -step inference as it is found to be sufficient for convergence in practice. We initialize the contour randomly within the ground truth boundary during training. For testing, we initialize in image centers for Vaihingen and Bing Huts due to the fact that most buildings of these two datasets are of regular shape. As for TorontoCity, we leverage the deep watershed transform (DWT) [3] to get instance proposals and initialize as described in Section 3.1. Standard data augmentation techniques (random rotation, flipping, scaling, color jitter) are used. We do not adopt common stabilization tricks during inference as they are non-differentiable. Instead, we found using the Euclidean distance transform to pre-train our , and maps helps stabilize the inference during training. We leave more details of pre-training in the supplementary material.

### 4.2 Results

Method | mIoU | WCov | PolySim | BoundF |
---|---|---|---|---|

FCN-8s [14] | - | 45.6 | 32.3 | - |

ResNet50 [8] | - | 40.1 | 29.2 | - |

DWT [3] | - | 52.0 | 24.0 | - |

DSAC [15] | 60.0 | 58.0 | 27.2 | 25.4 |

Ours, single init | 57.9 | 52.2 | 23.9 | 29.0 |

Ours, multi init | 60.1 | 57.5 | 26.8 | 29.6 |

Vaihingen and Bing Huts: We compare against the baseline methods and DSAC [15]. In particular, the baseline exploits a fully convolutional network (FCN) [14] as the backbone to perform semantic segmentation of buildings (interior, boundary, and exterior) and then the post-processing following [15] to obtain the predicted contour. For fair comparison, we also replace the backbone of DSAC and the baseline with ours. We summarize results in Table 1

. Compared to the strong FCN baselines, our method exhibits improved performance across the majority of metrics. In particular, the significant improvement on PolySim suggest our segmentations are more geometrically similar. Furthermore, our method significantly outperforms DSAC on all metrics. Even in instances where DSAC exhibits good coverage-based metrics, our method is significantly better at capturing edges. It is interesting to note that substituting our backbone in DSAC does not increase performance, while DSAC’s results generally exhibits higher variance, regardless of backbone.

TorontoCity: We compare against the semantic segmentation based methods that utilize FCN-8s [14] or ResNet50 [8], along with the instance segmentation method that relies on DWT. Additionally, because the commercial and industrial buildings contained in TorontoCity tend to possess complex boundaries, we examine the effects of using one versus multiple initializations described in Section 3.1. We summarize results in Table 2. Our method shows improved performance compared to DSAC on the mIOU. For weighted coverage, which accounts for the area of the building, we achieve comparable performance to DSAC. Our performance on this metric is influenced primarily by larger buildings, as moving from a single initialization to multiple initializations significantly improves the performance. We find that large areas presented by commercial and industrial lots present many ambiguous features such as mechanical penthouses and visible facades due to parallax error. Our single initialization method tends to produce smaller segmentations as it focuses on the boundaries offered by these features. This is alleviated by using multiple initializations. The weighting from larger buildings is also reflected in polygon similarity. The average BoundF metric demonstrates our method is significantly better at capturing building boundaries than DSAC. It is important to note that even our segmentations generated with a single initialization showed improved performance. To delve deeper into the quality of these segmentations, we examine their alignment with respect to the ground truth in Figure 6. We see that, for a given threshold for alignment error, our method exhibits superior recall. Overall, our multiple initialization scheme performs the best, although for lower error thresholds our single initialization is also very competitive.

Number of Initializations: In TorontoCity, we found that of examples required multiple initializations. For these examples, on average initializations were required.

Qualitative Discussion: We visualize some energies predicted by our CNN in Figure 7. We see the CNN opts to predict a term that has deep valleys at the building contours. The term adopts small values along the edges to encourage straightness at the boundaries, while the term acts to propel the contour points from inside. We show additional segmentations in Figure 8. The segmentations produced by our method are generally more adherent to the edges of the buildings. This is especially helpful when buildings are densely packed together, as seen in the TorontoCity results (columns e-f). Additionally, in comparison to DSAC, our parameterization successfully prevents self-intersecting contours (column b, second last row).

Failure Modes: The last two rows in Figure 8 demonstrate some weaknesses of our model. In cases where the model is unsure about the extent of one building, it will expand the contour until it meets the edge of another building (column (b), last row). Also, on large commercial lots (column f, last two rows), our method becomes confused by the shapes and features of the buildings.

## 5 Conclusion

In this paper, we presented an approach to building instance segmentation using a combination of active rays with energies predicted by a CNN. The use of a polar representation of contours enables us to predict contours that cannot self-intersect, and to employ a loss function that encourages our predicted contours to match building boundaries. Furthermore, we demonstrate a method to combine several predictions to generate more complex contours. Comparisons against other state-of-the-art methods on various buliding segmentation datasets demonstrate our method’s power in generating segmentations that better capture, and are better aligned with, building boundaries.

## Acknowledgments

We gratefully acknowledge support from the Vector Institute, and NVIDIA for donation of GPUs. S.F. also acknowledges the Canada CIFAR AI Chair award at the Vector Institute. We thank Relu Patrascu and Priyank Thatte for infrastructure support.

## References

- [1] International society for photogrammetry and remote sensing, 2d semantic labeling contest. http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html.
- [2] D. Acuna, H. Ling, A. Kar, and S. Fidler. Efficient interactive annotation of segmentation datasets with polygon-rnn++. In CVPR, pages 859–868, 2018.
- [3] M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. In CVPR, pages 2858–2866. IEEE, 2017.
- [4] L. Castrejon, K. Kundu, R. Urtasun, and S. Fidler. Annotating object instances with a polygon-rnn. In CVPR, volume 1, page 2, 2017.
- [5] L. D. Cohen. On active contour models and balloons. CVGIP: Image understanding, 53(2):211–218, 1991.
- [6] J. Denzler, H. Niemann, et al. Active rays: Polar-transformed active contours for real-time contour tracking. Real-Time Imaging, 5(3):203–213, 1999.
- [7] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In ICCV, pages 2980–2988, Oct 2017.
- [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- [9] P. Hu, B. Shuai, J. Liu, and G. Wang. Deep level sets for salient object detection. In CVPR, volume 1, page 2, 2017.
- [10] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, pages 2017–2025, 2015.
- [11] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. IJCV, 1(4):321–331, 1988.
- [12] H. Ling, J. Gao, A. Kar, W. Chen, and S. Fidler. Fast interactive object annotation with curve-gcn. In CVPR, 2019.
- [13] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance segmentation. In CVPR, 2018.
- [14] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, June 2015.
- [15] D. Marcos, D. Tuia, B. Kellenberger, L. Zhang, M. Bai, R. Liao, and R. Urtasun. Learning deep structured active contours end-to-end. In CVPR, pages 8877–8885, 2018.
- [16] S. Osher and J. A. Sethian. Fronts propagating with curvature-dependent speed: algorithms based on hamilton-jacobi formulations. Journal of Computational Physics, 79(1):12–49, 1988.
- [17] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Computer Vision and Pattern Recognition, 2016.
- [18] C. Rupprecht, E. Huaroc, M. Baust, and N. Navab. Deep active contours. arXiv preprint arXiv:1607.05074, 2016.
- [19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
- [20] G. Sundaramoorthi, A. Yezzi, and A. C. Mennucci. Sobolev active contours. International Journal of Computer Vision, 73(3):345–366, 2007.
- [21] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. JMLR, 6:1453–1484, 2005.
- [22] S. Wang, M. Bai, G. Mattyus, H. Chu, W. Luo, B. Yang, J. Liang, J. Cheverie, S. Fidler, and R. Urtasun. Torontocity: Seeing the world with a million eyes. In ICCV, pages 3028–3036, 2017.
- [23] Z. Wang, D. Acuna, H. Ling, A. Kar, and S. Fidler. Object instance annotation with deep extreme level set evolution. In CVPR, 2019.
- [24] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. In CVPR, 2017.
- [25] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, pages 2881–2890, 2017.

Comments

There are no comments yet.