Tool uses Canny edge detection to identify and crop out objects in images
This work describes algorithms for performing discrete object detection, specifically in the case of buildings, where usually only low quality RGB-only geospatial reflective imagery is available. We utilize new candidate search and feature extraction techniques to reduce the problem to a machine learning (ML) classification task. Here we can harness the complex patterns of contrast features contained in training data to establish a model of buildings. We avoid costly sliding windows to generate candidates; instead we innovatively stitch together well known image processing techniques to produce candidates for building detection that cover 80-85 possible candidates is important due to the scale of the problem. Each candidate is subjected to classification which, although linear, costs time and prohibits large scale evaluation. We propose a candidate alignment algorithm to boost classification performance to 80-90 algorithm and show it has negligible cost. Also, we propose a new concept called a Permutable Haar Mesh (PHM) which we use to form and traverse a search space to recover candidate buildings which were lost in the initial preprocessing phase.READ FULL TEXT VIEW PDF
Tool uses Canny edge detection to identify and crop out objects in images
Rapid detection and classification of discrete objects such as buildings in geospatial imagery has many applications such as damage assessments by comparing before and after building detections [Voigt et al., 2007] [Dong and Shan, 2013] [Brunner et al., 2010]. Large scale change detection at an object level can enable computer assisted updating of maps by identifying new or removed objects between multiyear satellite imagery [Bonnefon et al., 2002]. This could also allow for the next evolution of the USGS National Land Cover Database (NLCD) analysis [Xian et al., 2011]. Also, in a national security interest and in the funding motivation of this research, ontological analysis can be performed using the spatial arrangement of groups of buildings to identify large manufacturing, power generation, and weapons proliferation sites.
Problems restrict the usage of existing research which require camera alignment information (azimuth and zenith angles) and/or special equipment that captures near-infrared channels. Runtime is also a large factor which restricts the scale of deployment. In this work we present a combination of methods which have minimum imagery requirements (they work on common grayscale imagery) and provides scale and rotation invariant detection with a relatively inexpensive computation.
The first contribution of this paper is our method does not depend on sliding windows to generate building candidates (Section 2.1). Building candidates are rectangles, identified by a center, height, width, and rotation, that likely contain a building. If these were generated using a brute force sliding window approach processing an image very expensive because the centers can be any pixel, the width and height can be any combination (non-overlapping), and the rotation can be between . We devise a linear time strategy utilizing building shadows as a major feature because they are high contrast straight ‘L’ shaped feature unique to man made objects [Irvin and McKeown, 1989] [Lin and Nevatia, 1998] and [Karantzalos and Paragios, 2009].
The second contribution is how we align buildings in linear time to increase classification accuracy (Section 2.2
). We utilize a summation of Gaussians each centered and scaled depending on the direction and magnitude of the vectors that form the contour around a building. We describe a linear time algorithm for computing this and show it has a negligible cost as well as a significant performance gain of up to 5% accuracy.
The third contribution is our candidate Permutable Haar Mesh (PHM) search method that heuristically searches nearby candidate boxes to find buildings via a greedy graph search algorithm (Section2.4). Because we utilize Haar contrast features [Viola and Jones, 2004] for their supreme performance; if our building candidate box does not properly cover the building it will not be considered a building because its feature distributions won’t align to learned examples. The PHM approach is expensive and is not part of our rapid solution but can be employed to increase accuracy if it is really necessary.
An overview of our method is shown in Figure 1. First (in Figure 1a) Canny edge detection is run using a range of threshold values. The Canny edge detection [Canny, 1986] a fast straightforward method uses high and low thresholds to determine edges and using only one set of threshold values would not discover all buildings (Discussed in Section 2.1). Instead, all possible combinations of threshold values are used limited by a step size between the values. The resulting binary images are processed for contours (Figure 1b) in linear time [Chang et al., 2004]. Each contour is considered a candidate. Some of the resulting contours are filtered out based on a minimum number of pixels that can be used for prediction and if they are redundant to other contours by only differing by less than 5 pixels.
These contours have their alignment (Figure 1c) detected automatically (Note: in the figure a different building is used to illustrate this). Section 2.2 discusses the rotation method. The candidates are then automatically rotated to a standard alignment, scaled to a standard size, and converted to grayscale for Haar feature extraction (Figure 1d). This rotation is so the Haar features will have more correlation when a model is built.
For every candidate, Haar features are extracted from fixed locations to capture contrast (Figure 1e). Haar features have been successful and proved rapid and robust by [Viola and Jones, 2004]. To extract a Haar feature, a rectangle is first overlaid at a specific and consistent location on the image. The rectangle is split in half and the pixels inside each half are summed and subtracted from each other. The resulting value represents the contrast at that location in the image and can be compared to other images. Combinations of these features will be discriminative enough to build a model (Figure 1f). This model can then be used to predictions when given unseen Haar feature values from a new test image.
To complement this method we present an optional step (due to computational cost) which is a novel candidate permutation method called a Permutable Haar Mesh (PHM) to increase recall of candidates via greedy graph search (Section 2.4
). Recall is an evaluation metric representing how many buildings have not been missed, this metric is complementary to precision which represents how correct each prediction is. Candidates are surrounded by a bounding box and permuted by moving their top, bottom, left, and right boundaries in order to properly cover a candidate and capture buildings that would otherwise have been missed because the candidate didn’t properly cover the building.
We utilize building shadows as a major identifier of buildings because they are a high contrast feature which provides largely straight ‘L’ shaped contours unique to manmade objects [Irvin and McKeown, 1989] [Lin and Nevatia, 1998] and [Karantzalos and Paragios, 2009]. Canny edge detection [Canny, 1986] is still the state of the art edge detection method that can capture these shadows well. The result of Canny edge detection is a binary image representing the edges of the input. Candidates are isolated by applying a linear time contour generation algorithm [Chang et al., 2004] which groups together edge pixels and returns a set of vectors that trace along these edges forming a contour. Each contour is considered to be a candidate building, we will also call the derived forms of this contour a candidate such as a bounding box around the contour and the image pixels within this bounding box.
Canny edge detection has two hyperparameters, a high and low thresholds for hysteresis thresholding. Canny edge detection works by first computing the gradient of the image using the contrast between pixels (scaled between 0 and 1). Gradients below the low threshold are filtered out and will not be considered edges. Gradients above the high threshold are set as edges and any remaining gradients that are connected to edges are set as edges. One combination of parameters will likely not return correct candidates for all buildings in an image as shown in Figure2 because too high of a threshold can cause gradients of objects that neighbor buildings to become part of its contour while too low of a threshold may cause the gradients of a building not to be considered. These issues are almost always the case when buildings vary in size in the same image because gaps in high gradients along the side of a building require lower thresholds which will cause smaller buildings to be connected to neighboring objects.
In order to be scale invariant the union of the resulting contours from many different combinations of Canny threshold parameters are used to form the set of candidates. If the candidates generated in Figure 2 from the three different pairs of threshold values are merged together then all buildings will be included in the candidate set. However, as more threshold values are included, more non-buildings are included as well and create a challenge to later steps. Threshold values are chosen from a grid which is parametrized by a step size which controls the density of the grid. As the step size is decreased, more threshold values are included which results in more candidates. Section 3.1
studies the trade-off when decreasing the step size in order to maximize precision and recall.
This figure shows the application of Canny edge detection (center) and contour detection (right) at various threshold values to generate candidates. Red dashed boxes on the left show candidates that enclose buildings and green check marks are candidates that will be classified as buildings. As the high threshold parameter to the Canny edge detector is varied from 0.1 at the top to 0.9 at the bottom different contours are generated. There is no perfect parameters to generate correct candidates for both buildings.
Contours resulting from Chang’s contour detection [Chang et al., 2004] are represented by a set of vectors and each component vector . From these vectors we want to determine the aggregate direction of the object they represent. By rotating these candidates into alignment before extraction of the Haar features, the features become more discriminative and will result in an increase in accuracy of the trained classifier (explained in §2.3).
Determining the aggregate direction is difficult because buildings may not have their walls parallel to each other and the edge and contour detection methods may have introduced noise in the vector directions. Consider the simple example in Figure 3; suppose we have a contour made up of four vectors with the following directions and magnitudes , , , which would appear to be a rectangle with the longest side as the dominant edge. If the assumption is made that the majority of the walls length will point in the dominant direction of the building then we should be able to sum the vectors with the same angle to determine which angle the majority of the walls align with. However, taking the sum for each direction would not capture the similarity of and . They would be considered independent and Eq. 1 would result in as the dominant direction of the building which is false.
We need to tolerate this noisy data and take these situations into account because contours can be even more complex and misleading as seen in Figure 42.
To determine the alignment direction we evaluate the summation for a specific input degree from . Algorithm 1 formalizes this method. For each contour segment the angle is determined using the arctangent. The Gaussians are normalized based on their magnitude to the sum of all magnitudes. The maximum is then found by iterating over 180 possible angles. Figure 4 shows this method not only handles the specific issue we discussed of non parallel walls but also tolerates noise in the contour data. Noise meaning jitter in the angle of the vectors as they wrap around the building. This can be due to pixelation error during capturing the image, contours containing vectors that don’t overlap the building walls, or non-rectangular building shapes. This rotation method not only increases classification accuracy but does so with negligible increase in time (shown in §refsection:lineareval).
To build a classification model that can filter candidates into building and non-building, we need features that can discriminate effectively and are efficiently computed. Haar features have been shown to quickly capture discriminative contrast patterns effectively [Viola and Jones, 2004]. They are generated by taking a rectangular image mask and dividing it into two rectangles, with a horizontal or vertical division. The sum of the pixel values in one rectangle are subtracted from the sum of the pixel values in the other. Haar features are discriminative in face and crater detection [Cohen and Ding, 2013] because these domains have similar contrast at specific positions of the candidates. In this work each candidate is scaled to 200 x 200 pixels before Haar features are extracted. Horizontal and vertical Haar features are extracted in a sliding window fashion which extracts square regions from the image systematically from the top left to the bottom right. Square regions are extracted with pixel width 40, 80, and 100 are applied with a step size of 10 pixels. Also, square regions are extracted with width 20 with a step size of 5 pixels in order to capture small details. This yields a total of 3592 features. Each feature represents the horizontal or vertical contrast in that region with a signed integer value. A value of 0 means no contrast where a positive or negative value represents contrast in the positive or negative direction. The sign of the number is dependent on the order of the subtraction during extraction and is only useful for comparison.
By aligning buildings and adding padding to expose its edges, which have high contrast, we are able to obtain contrast patterns between candidates. For example the Haar features being extracted in Figure5a will statistically expose higher contrast in candidates which contain buildings due to the edges of this buildings appearing in the same location across examples. Also, roof texture and the surrounding area texture may also be consistent enough to provide linear separable distributions of values with respect to a building and non-building. In order to gain more insight we analyze the top weighted Haar features in the Linear AdaBoost classifier in Figure 5b where it can be seen that edges of buildings are very discriminative. We are able to conclude that the statements from previous work that find shadows a dominant feature are correct. Shadows will generally exist at the edges of buildings and provide strong contrast values at the edge of the roof where the shadow begins. Together, many of these features allow us to obtain a linear separable feature space to achieve accurate classification. One problem that arises from using these features is when buildings have black roofs the contrast between the roof and the shadow is very low and might appear to be very similar to a solid surface.
Some candidates are lost during the initial preprocessing step due to contours that cover part (or too much) of the building as shown in Figure 6 This leads to a misalignment of Haar features.
To solve this problem we present a Permutable Haar Mesh (PHM) algorithm which iteratively permutes the building candidate using a custom heuristic function to search the space shown in Figure 7. We perform a multi objective greedy search (for speed) using the following function (for accuracy) based on the result of a classifier:
Here we take the harmonic mean of
, the probability thatis a building, and , the complement of the probability that is not a building. Using a greedy search we evaluate each permutation and select the best increase in probability at each step of the iteration until we cannot improve the hypothesis probability. This method is outlined in Algorithm 2.
Our method is for generating candidates which places the training complexity on the classifier used. Each candidate generated as a negative example adds to the complexity. This can be reduced by generating less negative examples but this may also generate a classifier with lower performance.
When utilizing the classifier our method is in terms of pixels or candidates. In the worst case every pixel could be considered a candidate which would be determined in linear time using Canny edge detection and Chang’s linear contour detection, we call this . When sampling and merging using a specific step we incur a fixed cost dependent on the step size chosen. For this is leading to a potential candidates to evaluate. Our rotation method is based on the number of vectors in the contour () of the candidate. The maximum number of contours would be the number of pixels in the candidate. Our approximation method solves this in . Each candidate then has a fixed number
of Haar features extracted which is one initial cost of the candidates pixels for an integral image and then 4 additions per Haar feature. When using a linear classification model, such as Naive Bayes or AdaBoost on linear decision stump classifiers, each candidate can then be classified in linear time.
In order to evaluate our method we looked for publicly available datasets that would allow us to study the errors when applying methods to the average residential buildings as well as unique industrial buildings. [Mnih and Hinton, 2010] has generated a benchmark dataset using MassGIS which contains average residential buildings but industrial buildings such as coal and nuclear power plants are not released by MassGIS. Because of this we have built a dataset of nuclear power plant buildings that can be shared with the research community. We utilize these two datasets in order to showcase the robustness of our algorithm on imagery with various quality and content.
Dataset A (Figure (a)a) was constructed using images from Google Maps 222https://maps.google.com/ with various resolution, size, illumination, geographic region, building size, and building purpose. There are 411 buildings in this dataset which are mostly non-residential including large industrial and power generation. These buildings can be very unique to a specific purpose and vary greatly in size.
Dataset B (Figure (b)b) is a labelled subset of the dataset used in
[Mnih and Hinton, 2010]333http://www.cs.toronto.edu/vmnih/data/. We used a higher resolution (15cm/pixel) version of the same imagery acquired from MassGIS (coq2008_15cm_jp2). All buildings have the same illumination. This dataset is of a contiguous area composed of mostly residential buildings. In total there are 1337 buildings.
We use these datasets to first evaluate the recall obtained by our method. Recall is an evaluation metric representing how many buildings have not been missed, this metric is complementary to precision which represents how correct each prediction is. After this we discuss how our positive and negative examples are constructed to train a classifier. This is followed by an analysis of candidate alignments effect on these examples on various classifiers. We then discuss how we can increase recall with our PHM method can recover candidates and achieve better accuracy at the cost of a more computationally expensive method. Finally we evaluate the runtime of different components of our algorithm.
It is important that we achieve high recall in order to not miss any potential buildings using our candidate generation method. Unfortunately there are some complications that we had to overcome. Using a single high and low Canny threshold value we are only able to achieve low recall values. In Figure 9 we explore all possible configurations of low and high threshold values on dataset A. These results show a strange surface due to a trade off of capturing different sizes of the buildings. There does seem to be a peak but it is very low . Some buildings are only identified as candidates at specific threshold values so changing them misses some while finding others. The problem is that these values are not the same for every building in a dataset as shown in Figure 2. This observation leads us to our solution, because some buildings are only captured by different threshold values.
To solve this problem we generate candidates by sampling and merging the results of candidate generation at many different threshold values. The question now is what Low/High threshold values to use. We experiment with various step sizes through the space (0,0) to (1,1) in Figure 10. As the step size is reduced from 0.2 to 0.01 the recall increases at a diminishing rate. However, there is trade-off that must be made when choosing a small step size. In Figure 11 the total number of candidates that must be evaluated is analyzed. As the step size is reduced the total number of candidates increases to numbers that are much larger than the number of buildings that exist in those images. This may not only increase running time but also decrease overall performance by increasing the chance that a classifier may misclassify.
To put more context on Figure 10, in dataset A we start with 411 labeled buildings and our preprocessing step is able to find 86% when generating about 90,000 candidates. In dataset B we start with 1,337 labeled buildings and our preprocessing step is able to find 80% when generating about 240,000 candidates. To put this in perspective, without this preprocessing step, because the centers can be any pixel, the width and height can be any combination (non-overlapping), and the rotation can be between , a small 1,000 x 1,000 image can easily generate over 1 billion candidates using a sliding window for just one image in order to achieve 100% recall.
To learn an accurate classifier requires constructing a training set containing difficult realistic examples of what will be presented to the classifier during testing. We run the candidate generation process and subtract the positive examples. This process includes candidates that partially overlap the ground truth in order to train on examples that may be misclassified during testing. Our goal is to select strong representative examples that we expect to reside near the decision boundary of a classifier.
For all the evaluations following this section, 10-fold cross validation is used to calculate the F1-Score obtainable with a classifier. We define the F1-Score as follows:
Dataset A has 383 positive and 4,912 negative examples. Dataset B has 992 positive and 11,488 negative examples. The number of positive examples is less than the total ground truth number because some candidates are excluded because the 5% padding that is added goes out of the image bounds and is not included. The datasets are balanced in order for the classifiers to properly learn. This is done by randomly sampling with replacement to add duplicates to the positive examples.
All experiments are performed with the AdaBoost classifier unless otherwise noted. In the next section we compare many different classifiers. The Weka implementations of these algorithms are used with their default values.
Naive Bayes assumes all variables are conditionally independent with respect to the class label. This classifier then simply uses Bayes’ rule to determine the probability of a class attribute given feature values [John and Langley, 1995].
Random Forest constructs decision trees from subsets of features which are drawn uniformly with replacement from the global feature set. 100 trees are constructed. Each decision tree is constructed similar to J48. The resulting classification is a majority vote by all trees for a class label [Breiman, 2001].
Analysis is performed to evaluate the effect of rotating candidates on the overall pipeline. To demonstrate the versatility of this step we evaluate many classifiers. In Figure 12 it can be observed that rotating candidates increases the F1-Score of standard classification algorithms.
To evaluate the following classification methods we generate candidates from each training set using the sample and merging method with step size 0.05 and form an isolated set of candidate images so that 10-fold cross validation can easily be performed. The results here are the metrics from these isolated sets and therefore don’t reflect the impact of recall loss from the preprocessing method which is analyzed in Section 3.1.
We evaluate AdaBoost because it was used as part of the Viola and Jones face detection pipeline[Viola and Jones, 2004]
. AdaBoost is expected to be well suited for this task because it performs feature selection on the many Haar features generated from the candidate in both situations. This is however not the case. AdaBoost ranks among the worst classifiers evaluated.
We evaluate Naive Bayes and J48 Decision Tree classification models as baselines which are quick to train that we expect the reader will be familiar with. A random classifier was used to confirm 50% F1-Score indicating balanced training data. We also evaluate Random Forest and find it to outperform all other methods.
The previous classification models discussed can rapidly be trained and utilized in comparison to a Support Vector Machine (SVM) with a non-linear kernel. We were able to evaluate Dataset A using an SVM with a radial basis function kernel. However, due to the computational cost we are unable to evaluate Dataset B using an SVM because the algorithm did not terminate in 72 hours. It is interesting how poorly the SVM model performs. We can speculate that it may be caused by noisy or irrelevant Haar features. A large amount of features may cause the classifier to weight features inappropriately and skew classification. The increase in performance after candidate rotation may indicate this as it causes features to have a higher discriminative ability which can more easily be separated.
Overall, every classification method had its F1-Score increase after the alignment of candidates. The most significant increase was for an SVM classifier.
The primary goal of our preprocessing method is to maintain high recall. If candidates are still missed we can use the PHM method to salvage over/underdetected candidates as outlined in Section 2.4. This method is analyzed in Figure 13 to study how the F1-Score is impacted as the permutation rate is increased. For these experiments we used one combination of high and low Canny threshold values instead of merging many values together which yields lower recall values from the start.
In Figure 13 as the rate of permutation increases so does the recall. However, similarly as the permutation rate increases the precision falls. The increase in precision error is due to more candidates being presented to the classifier which appear to be buildings as a result of the PHM process itself. A compromise is found at the peak of the F1-Score plot of 0.01.
Our machine learning pipeline runs in linear time as theoretically explained in Section 2.5. We empirically evaluate the runtime on a single 3.07GHz Intel Xeon CPU. However many parts of the algorithm are easily made parallel to achieve major speed improvements.
The first way to empirically show this is during the initial contour extraction phase analyzed in Figure (a)a. Here images are processed one after another, the total number of pixels processed is plotted against the time taken. Here it is observed that aligning the contours only slightly increases the processing cost.
In Figure (b)b we perform the same evaluation but allow the process to continue to the step of extracting Haar features from every candidate. A strange result is that it takes less time when we add the rotation step. An answer for this may be that the scaling phase before Haar features are extracted is sped up because images contain less edges on diagonals.
In Figure 15 we evaluate the entire pipeline and observe that our basic machine learning (ML) approach appears significantly faster than PHM. For every candidate encountered during the algorithm the PHM will search possibly 100’s of surrounding candidates to find a better match. From our experience the machine learning approach appears to work in almost realtime on reasonably sized images.
Automated labeling of aerial images has been a motivating problem for researchers for a very long time [Irvin and McKeown, 1989]
. The development of an automated system to identify discrete objects, such as buildings, has been a much sought after goal. Many techniques from the field of computer vision have been employed, as well and statistical machine learning approaches. A number of surveys including[Mayer, 1999], [Baltsavias, 2004], and [Druaguct and Blaschke, 2006] indicate the depth of this field.
Unlike our method which relies only on RGB images, much work has been done using very high spatial resolution (VHR) multispectral data, [Sohn and Dowman, 2007] synthetic aperture radar (SAR) data [Simonetto et al., 2005] and light detection and ranging (LIDAR). This information has been used to filter out sections of images corresponding to non-building areas such as vegetation or water. Information such as azimuth and zenith angles has been used to calculate the shadow locations and near infrared to better determine building shadows from plant shadows [Ok, 2013].
Working only with images, other researchers have explored techniques using many different types of features that can capture texture information, color, shape, and contextual information. Simple features can be built using the color and intensity of pixels, and gradient based features have also been used. Local scale and rotation invariant features like Lowe’s SIFT [Lowe, 2004] and the sped up version SURF[Bay et al., 2006] have been evaluated [Yang and Newsam, 2013] [Sirmacek and Unsalan, 2011].
Shadows have been picked up as a powerful building indicator that can be identified by simple algorithms similar to ours [Irvin and McKeown, 1989] and [Wei et al., 2004]. Machine learning has been employed extensively, with various systems using features to train classifiers such as Support Vector Machines [Mountrakis et al., 2011]Mnih and Hinton, 2010].
Our method stands out from these other approaches because of our focus on speed and applicability to all geospatial imagery because our method only needs pure RGB images and does not require a near-infrared channel or azimuth and zenith angles. Also, unlike other methods we provide an implementation of our method.
In this paper we describe algorithms for reducing discrete object detection in reflective geospatial imagery to machine learning, specifically in the case of buildings. Results from the application of this method are shown in Figure 16. We have shown the complex patterns of a discrete object’s contrast features can be learned using state of the art ML methods. The reduction requires non-trivial ML-aware preprocessing methods. We have shown that these methods consistently increase the performance of classification algorithms. We also present the concept of a PHM in order to recover candidates that fail to be classified correctly. This method generates a search space which has potential to greatly increase detection rates and requires further research to fully utilize beyond what is explored in this paper.
This work is partially funded by a grant from the National Nuclear Security Agency of the U.S. Department of Energy (grant number: DE-NA0001123) as well as by the National Science Foundation Graduate Research Fellowship Program (grant number: DGE-1356104). This work utilized the supercomputing facilities managed by the Research Computing Department at the University of Massachusetts Boston as well as the resources provided by the Open Science Grid, which is supported by the National Science Foundation and the U.S. Department of Energy’s Office of Science.
Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, UAI’95, pages 338–345, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
Geographic image retrieval using local invariant features.Geoscience and Remote Sensing, IEEE Transactions on, 51(2):818–832.